abstract - nc state universityabstract ma, yanting. solving large-scale inverse problems via...

ABSTRACT

MA, YANTING. Solving Large-Scale Inverse Problems via Approximate Message Passing andOptimization. (Under the direction of Dror Baron.)

This work studies the problem of reconstructing a signal from measurements obtained by

a sensing system, where the measurement model that characterizes the sensing system may be

linear or nonlinear.

We first consider linear measurement models. In particular, we study the popular low-

complexity iterative linear inverse algorithm, approximate message passing (AMP), in a prob-

abilistic setting, meaning that the signal is assumed to be generated from some probability

distribution, though the distribution may be unknown to the algorithm. The existing rigorous

performance analysis of AMP only allows using a separable or block-wise separable estimation

function at each iteration of AMP, and therefore cannot capture sophisticated dependency

structures in the signal. This work studies the case when the signal has a Markov random field

(MRF) prior, which is commonly used in image applications. We provide rigorous performance

analysis of AMP with a class of non-separable sliding-window estimation functions, which is

suitable to capture local dependencies in an MRF prior.

In addition, we design AMP-based algorithms with non-separable estimation functions for

hyperspectral imaging and universal compressed sensing (imaging), and compare our algorithms

to state-of-the-art algorithms with extensive numerical examples. For fast computation in large-

scale problems, we study a multiprocessor implementation of AMP and provide its performance

analysis. Additionally, we propose a two-part reconstruction scheme where Part 1 detects zero-

valued entries in the signal using a simple and fast algorithm, and Part 2 solves for the remaining

entries using a high-fidelity algorithm. Such two-part scheme naturally leads to a trade-off

analysis of speed and reconstruction quality.

Finally, we study diffractive imaging, where the electric permittivity distribution of an ob-

ject is reconstructed from scattered wave measurements. When the object is strongly scattering,

a nonlinear measurement model is needed to characterize the relationship between the permit-

tivity and the scattered wave. We propose an inverse method for nonlinear diffractive imaging.

Our method is based on a nonconvex optimization formulation. The nonconvex solver used in

the proposed method is our new variant of the popular convex solver, fast iterative shrink-

age/thresholding algorithm (FISTA). We provide a fast and memory-efficient implementation

of our new FISTA variant and prove that it reliably converges for our nonconvex optimization

problem. Hence, our new FISTA variance may be of interest on its own as a general noncon-

vex solver. In addition, we systematically compare our method to state-of-the-art methods on

simulated as well as experimentally measured data in both 2D and 3D (vectorial field) settings.

Solving Large-Scale Inverse Problems via Approximate Message Passing and Optimization

byYanting Ma

A dissertation submitted to the Graduate Faculty ofNorth Carolina State University

in partial fulfillment of therequirements for the Degree of

Doctor of Philosophy

Electrical Engineering

Raleigh, North Carolina

2017

APPROVED BY:

Jack SilversteinMinor Member

Brian Hughes

Cranos Williams Deanna NeedellExternal Member

Ahmad BeiramiExternal Member

Dror BaronChair of Advisory Committee

BIOGRAPHY

Yanting Ma received the B.S. degree in communication engineering from Wuhan University,

China and started the graduate program at the Department of Electrical and Computer En-

gineering of North Carolina State University in 2012. She was a research intern at Mitsubishi

Electric Research Laboratories (MERL), Cambridge, MA, in the summers of 2016 and 2017.

Her research interests are mainly in low complexity algorithm design and analysis for large-

scale linear and nonlinear inverse problems using message passing and optimization techniques,

especially with applications in computational imaging. Additionally, she is generally interested

in applied probability, stochastic processes, convex analysis, and partial differential equations.

ii

ACKNOWLEDGMENTS

I would like to first thank my advisor Prof. Dror Baron for providing me with the opportunity

to pursue research topics that I am truly enthusiastic about. I especially thank him for helping

me build connections to the great people whom I would like to thank in the following.

My deepest thanks go to Prof. Min Kang. From the two probability courses that I took with

her, I started to appreciate the beauty of math. Building the good habit of learning rigorously

and understanding materials thoroughly has been extremely helpful in my later studies. I also

thank Prof. Kang for her kindness and encouragement during hard times.

Thanks to Prof. Jack Silverstein for being on my committee and for, together with Prof.

Kang, helping me learn convex analysis in a completely rigorous way. Our study on convex

optimization created the opportunity for my second internship at Mitsubishi Electric Research

Laboratories (MERL). Without their help in improving my mathematical skills, I would not be

confident in pursuing the interesting research topics in the later stage of my PhD journey.

Thanks also to my other primary collaborators: Prof. Cynthia Rush, Prof. Ulugbek Kamilov,

Prof. Yue Lu, Prof. Deanna Needell, Dr. Ahmad Beirami, Dr. Jin Tan, and Dr. Junan Zhu.

I would especially thank Cindy for working closely with me on state evolution analysis of

approximate message passing, which is presented in Chapter 2 and Appendix A. (You may

notice that this work occupies almost half of this dissertation.) Ulugbek opened the door to the

fascinating field of computational imaging for me and our work is presented in Chapter 5 and

Appendix D. I thank Ulugbek for hosting me at MERL and sharing his creative ideas. Thanks

to Jin for her help with my research in the early stage of the program and being a considerate

friend. What I gained from my collaborators is much more than just publications, of course.

Their way of thinking and approaching problems would definitely influence me in the future.

I thank Prof. Cranos Williams and Prof. Brian Hughes for being on my committee and pro-

viding valuable comments. Thanks to my other collaborators at MERL, Dr. Petros Boufounos,

Dr. Dehong Liu, Dr. Hassan Mansour, Dr. Yuichi Taguchi, and Dr. Anthony Vetro. The two

summer internships at MERL were invaluable experiences for me. There I found an open-minded

research culture and exciting multidisciplinary research topics. Also from MERL, thanks to Dr.

Teng-Yok Lee and Dr. Chungwei Lin for insightful discussions and for being good friends.

Thanks to Prof. Ramji Venkataramanan for his generous help with my application for the

Newton International Fellowships, where he devoted a large amount of time discussing re-

search directions with me and revising my research proposal. Although the application was not

awarded, I wish to collaborate with Ramji and pursue the proposed research in the future. Also

for this application, I thank Prof. Daniel Stancil, Prof. Wei Dai, Prof. Yue Lu, and Prof. Dror

Baron for writing reference letters for me.

Last but not least, I thank my family and friends for their unconditional love and support.

iii

TABLE OF CONTENTS

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Approximate Message Passing for Linear Inverse Problems . . . . . . . . . . . . . 21.2 Nonlinear Diffractive Imaging via Optimization . . . . . . . . . . . . . . . . . . . 31.3 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Chapter 2 State Evolution Analysis of Approximate Message Passingwith Non-Separable Denoisers . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Definition of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Definitions and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.2 Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.3 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Proof of Theorem 2.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.1 Proof Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3.2 Concentrating Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.3.3 Conditional Distribution Lemma . . . . . . . . . . . . . . . . . . . . . . . 232.3.4 Main Concentration Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . 262.3.5 Proof of Theorem 2.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4 Proof of Lemma 2.3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.4.1 Step 2: Showing that H1 holds . . . . . . . . . . . . . . . . . . . . . . . . 292.4.2 Step 4: Showing that Ht+1 holds . . . . . . . . . . . . . . . . . . . . . . . 34

2.5 Additional Result for 1D Signals with Markov Chain Priors . . . . . . . . . . . . 392.5.1 Definitions and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 392.5.2 Performance Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.5.3 Proof of Theorem 2.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Chapter 3 Application of Approximate Message Passing with Non-SeparableDenoisers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.1 Approximate Message Passing with Universal Denoiser . . . . . . . . . . . . . . . 493.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.1.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.1.3 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.2 Approximate Message Passing with Adaptive Wiener Filter . . . . . . . . . . . . 643.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643.2.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.2.3 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Chapter 4 Fast Computation in Linear Inverse Problems . . . . . . . . . . . . . 70

iv

4.1 Multiprocessor Approximate Message Passing with Column-Wise Partitioning . . 704.1.1 Definition of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 714.1.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.1.3 Proof of Theorem 4.1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.1.4 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.2 Two-Part Reconstruction Framework . . . . . . . . . . . . . . . . . . . . . . . . . 814.2.1 The Noisy-Sudocodes Algorithm . . . . . . . . . . . . . . . . . . . . . . . 814.2.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.2.3 Trade-Off between Runtime and Reconstruction Quality . . . . . . . . . 894.2.4 Application to 1-Bit Compressed Sensing . . . . . . . . . . . . . . . . . . 91

4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

Chapter 5 Nonlinear Diffractive Imaging via Optimization . . . . . . . . . . . . 965.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.2.1 Scalar Field Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 975.2.2 Vectorial Field Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995.2.3 Nonconvex Optimization Formulation . . . . . . . . . . . . . . . . . . . . 102

5.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

Chapter 6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .112

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .115

APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .126Appendix A Chapter 2 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

A.1 Concentration Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127A.2 Other Useful Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128A.3 Concentration with Dependencies for Theorem 2.2.1 . . . . . . . . . . . . . 129A.4 Concentration with Dependencies for Theorem 2.5.1 . . . . . . . . . . . . . 136

Appendix B Chapter 3 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . 151B.1 Derivation of (3.4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

Appendix C Chapter 4 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . 154C.1 Proof of Lemma 4.1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154C.2 Proof of Lemma 4.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156C.3 Proof of Lemma 4.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

Appendix D Chapter 5 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . 159D.1 Proof of Proposition 5.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159D.2 Proof of Proposition 5.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160D.3 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

D.3.1 Definitions and Standard Results . . . . . . . . . . . . . . . . . . . . 161D.3.2 Convergence of Relaxed FISTA . . . . . . . . . . . . . . . . . . . . . 165

v

LIST OF FIGURES

Figure 2.1 For Λ of size 3× 3, the denoiser ηt : RΛ → R may only process the pixelsin gray (the center and the four adjacent pixels). . . . . . . . . . . . . . . 8

Figure 2.2 Illustration of the definition of “missing” entries in a sliding window in Z2.The matrix v ∈ R4×4. The half-window size is k = 1, thus Λ = [3] × [3].For the window Λ(1,1) centered at coordinate (1, 1), the “existing” entriesin the window are v1,1, v1,2, v2,1, v2,2 as shown in dark gray. Five entries,which are in light gray, are missing, hence we define their value to be theaverage of the existing ones, v = 1

4(v1,1 + v1,2 + v2,1 + v2,2). . . . . . . . . 9Figure 2.3 Numerical example. From left to right: ground-truth image generated by

the MRF described in Section 2.2.3.1, image reconstructed by AMP witha separable Bayesian denoiser (computed from the incorrect assumptionthat the signal is generated from an i.i.d. Bernoulli distribution), andimage reconstructed by AMP with a Bayesian sliding-window denoiserwith k = 1, hence Λ = [3]× [3]. (Γ = [128]× [128], δ = 0.5, SNR = 17 dB.) 16

Figure 2.4 Numerical verification that the empirical MSE achieved by AMP withsliding-window denoisers is tracked by state evolution. The empiricalMSE is averaged over 50 realizations of the MRF (as described in Section2.2.3.1), measurement matrix, and measurement noise. (Γ = [128]× [128],δ = 0.5, SNR = 17 dB.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Figure 2.5 Reconstruction of texture images using AMP with different denoisers.From left to right: original gray level images, binary ground-truth images,images reconstructed by AMP with a total variation denoiser [8], non-separable Bayesian sliding-window denoiser (MRF prior, k = 1), andseparable Bayesian denoiser (Bernoulli prior), respectively. From top tobottom: images of cloud, leaf, and wood, respectively. (Γ = [128]× [128],δ = 0.3, SNR = 20 dB.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Figure 3.1 Flow chart of AMP-UD. AMP decouples the linear inverse problem intodenoising problems. In the tth iteration, the universal denoiser ηuniv,t(·)converts stationary ergodic signal denoising into i.i.d. signal denoising.Each i.i.d. denoiser ηiid,t(s

t,(l)) generates the denoised signal xt+1,(l) andthe derivative of the denoiser η′iid,t(s

t,(l)) for l ∈ [L]. The algorithm stopswhen the iteration index t reaches the predefined maximum tMax, andoutputs xtMax as the final result. . . . . . . . . . . . . . . . . . . . . . . . 52

Figure 3.2 Comparison of the reconstruction results obtained by the two AMP-UDimplementations to those by SLA-MCMC and EM-GM-AMP-MOS forsimulated i.i.d. sparse Laplace signals. Note that the SDR curves forthe two AMP-UD implementations and EM-GM-AMP-MOS overlap theMMSE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Figure 3.3 Comparison of the reconstruction results obtained by the two AMP-UDimplementations to those by SLA-MCMC and turboGAMP for simulatedstationary ergodic signals. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

vi

Figure 3.4 Comparison of the reconstruction results obtained by the two AMP-UDimplementations to those by SLA-MCMC and EM-GM-AMP-MOS forreal-world signals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Figure 3.5 Comparison of the reconstruction results obtained by AMP-UD2 to thoseby AMP-BM3D for images of size 128×128 from noiseless measurements.From top to bottom: ground-truth images, images reconstructed by AMP-UD2, and images reconstructed by AMP-BM3D. From left to right: thefirst four columns are natural images (δ = 0.3), and the last column is arealization of the MRF defined in Section 2.2.3.1 (δ = 0.5). . . . . . . . . 63

Figure 3.6 The matrix H is presented for K = 2, I = J = 8, and L = 4. Thecircled diagonal patterns that repeat horizontally correspond to the codedaperture pattern used in the first FPA shot. The second coded aperturepattern determines the next set of diagonals. . . . . . . . . . . . . . . . . 65

Figure 3.7 The Lego scene. (The target object presented in the experimental results was

not endorsed by the trademark owners and it is used here as fair use to illus-

trate the quality of reconstruction of compressive spectral image measurements.

LEGO is a trademark of the LEGO Group, which does not sponsor, authorize

or endorse the images in this work. The LEGO Group. All Rights Reserved.

http://aboutus.lego.com/en-us/legal-notice/fair-play/.) . . . . . . . . . . . . . 67Figure 3.8 Comparison of AMP-3D-Wiener, GPSR, and TwIST for the Lego image

cube. Cube size is I = J = 256, and L = 24. The measurements arecaptured withK = 2 shots using complementary random coded apertures,and the number of measurements is n = 143, 872. Random Gaussian noiseis added to the measurements such that the SNR is 20 dB. . . . . . . . . 68

Figure 4.1 C-MP-AMP for Gaussian matrices. . . . . . . . . . . . . . . . . . . . . . . 80Figure 4.2 C-MP-AMP for non-Gaussian matrices. . . . . . . . . . . . . . . . . . . . 81Figure 4.3 Top: Relative error between the empirical and theoretical probability of

missed detection. Bottom: Relative error between the empirical and the-oretical probability of false alarm. (The theoretical probabilities rely onthe asymptotic independence result of Lemma 4.2.1.) . . . . . . . . . . . . 86

Figure 4.4 Numerical verification of approximations made in the analysis of Part 2. . 89Figure 4.5 Trade-offs between reconstruction quality, measurement rate δ, and run-

time of Noisy-Sudocodes with AMP in Part 2. . . . . . . . . . . . . . . . . 90Figure 4.6 Numerical verification of the prediction for SDR (3.10) (top) and runtime

(bottom) of Noisy-Sudocodes with AMP in Part 2. . . . . . . . . . . . . . 91Figure 4.7 Numerical results of Noisy-Sudocodes with BIHT in Part 2 in a noisy

1-bit CS setting. In both figures, Top: SDR as a function of measurementrate δ. Bottom: SDR as a function of runtime. (n1/N = 0.1, n2 = n−n1,N = 10, 000, s = 0.005, d = 0.8) . . . . . . . . . . . . . . . . . . . . . . . . 94

vii

Figure 5.1 Visual representation of the measurement scenario considered in thiswork. An object with a real permittivity contrast χ(r) is illuminated withan input wave uin(r), which interacts with the object and results in thescattered wave usc at the sensor domain Γ ⊂ R2. The complex scatteredwave is captured at the sensor and the algorithm proposed here is usedfor estimating the contrast χ. . . . . . . . . . . . . . . . . . . . . . . . . . 98

Figure 5.2 The measurement scenario for the 3D case considered in this work. Theobject is placed within a bounded image domain Ω. The transmitter an-tennas (Tx) are placed on a sphere and are linearly polarized. The arrowsin the figure define the polarization direction. The receiver antennas (Rx)are placed in the sensor domain Γ within the x-y (azimuth) plane, andare linearly polarized along the z direction. . . . . . . . . . . . . . . . . . 99

Figure 5.3 Empirical convergence speed for relaxed FISTA with various α valuestested on experimentally measured data. . . . . . . . . . . . . . . . . . . . 105

Figure 5.4 Comparison of different reconstruction methods for various contrast levelstested on simulated data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

Figure 5.5 From top to bottom: Reconstructed images obtained by FB, IL, andCISOR. Each column represents one contrast value as indicated at thebottom of the images on the third row. CISOR is stable for all testedcontrast values, whereas FB and IL fail for large contrast. . . . . . . . . . 108

Figure 5.6 Images reconstructed by different algorithms from experimentally mea-sured data for 2D objects. The first and second rows use the FoamDielExtTMand the FoamDielIntTM objects, respectively. From left to right: groundtruth, images reconstructed by CISOR, SEAGLE, IL, CSI, and FB. Thecolor-map for FB is different from the rest, because FB significantly un-derestimated the contrast value. The size of the reconstructed objects are128× 128 pixels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Figure 5.7 Images reconstructed by CISOR, IL, and CSI from experimentally mea-sured data for the TwoShperes object. From left to right: ground truth,image slices reconstructed by CISOR, IL, and CSI, the reconstructed con-trast distribution along the dashed lines showing in the image slices on thefirst column. From top to bottom: image slices parallel to the x-y planewith z = 0 mm, parallel to the x-z plane with y = 0 mm, and parallelto the y-z plane with x = −25 mm. The size of the reconstructed objectsare 32× 32× 32 pixels for a 150× 150× 150 mm cube centered at (0, 0, 0).109

Figure 5.8 Images reconstructed by CISOR, IL, and CSI from experimentally mea-sured data for the TwoCubes object. From left to right: ground truth,image slices reconstructed by CISOR, IL, and CSI, the reconstructed con-trast distribution along the dashed lines showing in the image slices onthe first column. From top to bottom: image slices parallel to the x-yplane with z = 33 mm, parallel to the x-z plane with y = −17 mm, andparallel to the y-z plane with x = 17 mm. The size of the reconstructedobjects are 32 × 32 × 32 pixels for a 100 × 100 × 100 mm cube centeredat (0, 0, 50) mm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

viii

Chapter 1

Introduction

Computational sensing aims to utilize advanced computational inverse methods to improve the

signal reconstruction quality while enabling reduction in data acquisition time and cost. For re-

liable reconstruction, it is important to use an accurate mathematical model for the relationship

between the measurements acquired by the sensing system and the signal to be reconstructed.

In many cases, linear formulations are adequate, whereas in other cases, nonlinear formulations

need to be considered. In addition, incorporating prior information about the signal is also

crucial to improve the reconstruction quality, especially when the number of measurements is

limited, hence the problem is ill-posed. While many conventional algorithms and their analy-

ses rely on the assumption of independence among signal entries, there is increased interest in

exploiting local and global dependencies.

A mathematical model for inverse problems can be expressed as follows. Let x ∈ CΓ denote

the unknown signal to be reconstructed, where the index set Γ ∈ Zp with p = 1, 2, 3 is a finite

rectangular lattice. Moreover, let y ∈ Cn be the measurements acquired by a sensing system and

A : CΓ → Cn be an operator modeling the relationship between measurements and unknowns.

Then we have

y = A(x) + w, (1.1)

where w ∈ Cn is noise. When A is a linear operator, we use a matrix representation A ∈ Cn×|Γ|

for A, where |Γ| denotes the cardinality of Γ. Let V (script V stands for “vectorization”) be an

invertible operator that rearranges elements of its argument into a vector, hence V(x) ∈ C|Γ|.The linear system is then

y = AV(x) + w. (1.2)

This dissertation studies the problem of estimating x given y, A, and possible statistical

1

information about w. For the linear case, we study the algorithmic and theoretical aspects

of the class of approximate message passing (AMP) algorithms [39]. For the nonlinear case,

we study multiple scattering inversion for nonlinear diffractive imaging based on nonconvex

optimization.

1.1 Approximate Message Passing for Linear Inverse Problems

Recently, a low-complexity iterative algorithm called approximate message passing (AMP) [39]

has received considerable attention for large-scale linear inverse problems. The performance of

AMP depends on a sequence of estimation functions ηtt≥0 used to generate a sequence of

estimates xtt≥0 from auxiliary observations stt≥0 at every iteration t of the algorithm. The

function ηt is said to be separable if it acts on st coordinate-wise. When separable functions

are applied in AMP, for linear systems where the matrix A has independent and identically

distributed (i.i.d.) Gaussian entries and the empirical distribution of x converges to some prob-

ability measure px on R, Bayati and Montanari [7] proved that for any fixed t, the performance

of AMP such as the normalized `2 error 1|Γ|‖V(xt − x)‖22 converges almost surely to a deter-

ministic value predicted by a scalar recursion called state evolution as n, |Γ| → ∞ with the

ratio n|Γ| → δ ∈ (0,∞). In cases where x has i.i.d. sub-Gaussian entries, Rush and Venkatara-

manan [104] proved a finite-sample result stating that the probability of ε-deviation of various

performance measures from the state evolution prediction falls exponentially in |Γ|. The con-

dition for ηt being separable has limited AMP to incorporate prior information about x that

may have dependencies among entries. While block-separable functions have been considered

in some specific cases [58, 103], more general estimation functions are needed to capture more

sophisticated prior information. This work provides performance analysis of AMP when a class

of non-separable sliding-window estimation functions is applied and x has a Markov random

field prior. Markov random fields are widely used in many image processing problems, especially

for texture images [30, 40].

The aforementioned theoretical results [7, 58, 103, 104] are achieved by characterizing the

probability distribution of st when applied as an argument of the estimation function ηt com-

posed with the a loss function such as the quadratic function for squared error. Specifically, st

is shown to be close in distribution to the true signal x plus i.i.d. Gaussian noise, in a sense that

will be made clear in Chapter 2. For this reason, the estimation functions ηtt≥0 are referred

to as denoisers. The availability of such statistical characterization of st, which is the key differ-

ence between AMP and other inverse methods, has inspired AMP-based algorithms in various

applications such as compressive imaging [110], hyperspectral unmixing [122], and sparse su-

2

perposition codes decoding [103]. This work proposes AMP-based algorithms in hyperspectral

imaging, universal compressed sensing (imaging), and multiprocessor computing.

1.2 Nonlinear Diffractive Imaging via Optimization

In diffractive imaging, an object is illuminated by some incident wave (light), and the wave

is scattered when it passes through the object. The goal is to reconstruct the object, more

precisely, the electric permittivity of the object, from the scattered wave measurements. Con-

ventional methods usually rely on linearizing the relationship between the permittivity and

the measurements. For example, the first Born [16] and the Rytov [34] approximations are

commonly adopted in diffraction tomography [21, 68, 112]. However, linear models are highly

inaccurate when the physical size of the object is large or the permittivity contrast of the object

compared to the background is high [25]. Therefore, in order to image strongly scattering ob-

jects such as human tissue [92], nonlinear formulations that can model multiple scattering need

to be considered. The challenge is then to develop fast, memory-efficient, and reliable inverse

methods that can account for the nonlinearity. This work proposes an inverse algorithm for fast

and memory-efficient nonlinear diffractive imaging with rigorous convergence analysis.

A standard way for solving inverse scattering problem is via optimization, where a sequence

of estimates is generated by minimizing a cost function. For ill-posed problems with an additive

measurement noise model, a cost function usually consists of a quadratic data-fidelity term and

a regularization term, which incorporates prior information such as transform-domain sparsity.

The challenge of such a formulation for nonlinear diffractive imaging is that the data-fidelity

term is nonconvex due to the nonlinearity and that sparsity-promoting regularizers are usually

nondifferentiable. For such nonsmooth and nonconvex problems, the proximal gradient method,

also known as iterative shrinkage/thresholding algorithm (ISTA) [10, 32, 43], is a natural choice

and enjoys convergence guarantees. However, it usually converges slowly. Fast iterative shrink-

age/thresholding algorithm (FISTA) [8] is an accelerated variant of ISTA, which is proved to

converge fast for convex problems. Unfortunately, its convergence analysis for nonconvex prob-

lems has not been established. This work proposes a relaxed variant of FISTA for the nonsmooth

and nonconvex problems and provides its convergence guarantee.

1.3 Dissertation Organization

This rest of the dissertation is organized as follows. Chapter 2 provides a state evolution anal-

ysis of AMP with a class of non-separable sliding-window denoisers ηt when the matrix A has

3

i.i.d. Gaussian entries and the unknown signal x has a Markov random field prior. Chapter 3

presents the application of AMP with non-separable denoisers to hyperspectral imaging and

universal compressed sensing (imaging), where we study the empirical performance of the pro-

posed AMP-based algorithms by comparing them to several state-of-the-art algorithms with

extensive numerical examples. Chapter 4 introduces two methods for fast computing. The first

method is a multiprocessor implementation of AMP with column-wise partitioning of the matrix

A. We provide a state evolution analysis for our column-wise multiprocessor AMP algorithm.

The second method is a two-part framework, where Part 1 uses a sparse sensing matrix for

fast detection of zero-valued entries in x, and Part 2 uses a dense sensing matrix and applies

standard linear inverse algorithms such as AMP to reconstruct the remaining entries. Chapter

5 proposes a nonlinear inverse method for diffractive imaging based on a nonconvex optimiza-

tion formulation. The nonconvex solver used in the proposed method is our relaxed variant

of FISTA. We provide a fast and memory-efficient implementation and rigorous convergence

analysis of the proposed method. Finally, Chapter 6 concludes the dissertation and discusses

future work.

1.4 Notation

For an array x ∈ RΓ for some Γ ⊂ Zp with p = 1, 2, 3,

‖x‖ :=

√∑i∈Γ

x2i .

Hence, if x is a matrix, ‖x‖ denotes the Frobenius norm; the operator norm of a matrix

x is denoted by ‖x‖op.

For a vector x ∈ CN , diag(x) ∈ CN×N is a diagonal matrix with x on the diagonal.

A set of successive integers 1, ..., N is denoted by [N ].

We use both ex and exp(x) to denote the nature exponential function.

Throughout the dissertation, we consider the probability space (Ω,F , P ). For a random

variable X defined on (Ω,F , P ), E[X] denotes the expected value of X.

A Gaussian distribution with mean µ and variance σ2 is denoted by N (µ, σ2).

A random variable X with distribution measure px is denoted by X ∼ px.

4

Chapter 2

State Evolution Analysis of

Approximate Message Passing

with Non-Separable Denoisers

1 The approximate message passing (AMP) algorithm is initially proposed [39] and analyzed

[7, 104] in the context of compressed sensing [38] to estimate an unknown vector x ∈ RN from

linear measurements y ∈ Rn obtained from (1.2) using separable denoisers ηtt≥0 : R → Rthat act coordinate-wise when applied to a vector. Starting with x0 = 0, an all-zero vector, for

iteration index t ≥ 0, AMP proceeds as follows:

zt = y −Axt +zt−1

n

N∑i=1

η′t−1([A∗zt−1 + xt−1]i), (2.1)

xt+1i = ηt([A

∗zt + xt]i), ∀i ∈ [N ], (2.2)

where η′t denotes the derivative of ηt, A∗ denotes the transpose of A, and quantities with neg-

ative iteration indices are set to zero. Under the assumption that A has i.i.d. Gaussian entries,

x has i.i.d. sub-Gaussian entries according to a probability distribution px, and w has i.i.d.

sub-Gaussian entries with zero-valued mean and variance σ2w, Rush and Venkataramanan [104]

established the following performance guarantee for the above AMP algorithm with separable

denoisers, which implies an earlier asymptotic result proved by Bayati and Montanari [7]. For

1The work in this chapter was joint with Cynthia Rush and Dror Baron [81, 82]; it was funded by the NationalScience Foundation under grants CCF-1217749 and ECCS-1611112.

5

any (order-2) pseudo-Lipschitz function2 φ : R2 → R, ε ∈ (0, 1), and t ≥ 0,

P

(∣∣∣∣∣ 1

N

N∑i=1

φ(xt+1i , xi)− E [φ(ηt(X + τtZ), X)]

∣∣∣∣∣ ≥ ε)≤ Kte

−κtNε2 ,

where δ = n/N , Kt, κt > 0 are constants that do not dependent on N, ε, but may depend

on t, X ∼ px is independent of Z ∼ N (0, 1), and τt is defined recursively as follows. Let

τ20 = σ2

w + 1δE[X2], for t ≥ 1, define

τ2t+1 = σ2

w +1

δE[(ηt(X + τtZ)−X)2

]. (2.3)

If the unknown signal x has a prior distribution assuming i.i.d. coordinates, restricting

consideration to only separable denoisers causes no loss in performance. However, in many real-

world applications, the unknown signal x contains dependencies between entries and therefore

a coordinate-wise independence structure is not a good approximation for the prior of x. For

example, when the signals are images [88, 118], non-separable denoisers outperform reconstruc-

tion techniques based on over-simplified i.i.d. models. In such cases, a more appropriate model

might be a finite memory model, well-approximated with a Markov random field prior. In this

work, we extend the previous performance guarantees for AMP to a class of non-separable

sliding-window denoisers when the unknown signal has a Markov random field prior. Sliding-

window schemes have been studied for denoising signals with dependencies among entries by,

for example, Sivaramakrishnan and Weissman [107, 108].

2.1 Definition of the Algorithm

Notation: Before introducing the algorithm, we provide some notation that is used to define

the sliding window in the sliding-window denoiser. Without loss of generality, we let the index

set Γ ⊂ Zp, on which the input signal x in (1.2) is defined, be

Γ =

[N ], if p = 1

[N ]× [N ], if p = 2

[N ]× [N ]× [N ], if p = 3

. (2.4)

2A function f : Rm → R is said to be (order-2) pseudo-Lipschitz if there exists a constant L ∈ (0,∞) suchthat |f(x)− f(y)| ≤ L(1 + ‖x‖+ ‖y‖)‖x− y‖, for all x,y ∈ Rm.

6

Similarly, let Λ be a p-dimensional cube in Zp with length (2k + 1) in each dimension, namely,

Λ :=

[2k + 1], if p = 1

[2k + 1]× [2k + 1], if p = 2

[2k + 1]× [2k + 1]× [2k + 1], if p = 3

, (2.5)

where 2k + 1 ≤ N . We call k the half-window size.

AMP with sliding-window denoisers: The AMP algorithm for estimating x from y and

A in (1.2) generates a sequence of estimates xtt≥0, where xt ∈ RΓ, t is the iteration index,

and the initialization x0 := 0 is an all-zero array with the same dimension as the input signal

x. For t ≥ 0, the algorithm proceeds as follows:

zt = y −AV(xt) +zt−1

n

∑i∈Γ

η′t−1

([V−1(A∗zt−1) + xt−1

]Λi

), (2.6)

xt+1i = ηt

([V−1(A∗zt) + xt]Λi

), for all i ∈ Γ, (2.7)

where the set of denoisers ηtt≥0 : RΛ → R, η′t−1 is the partial derivative w.r.t. the center

coordinate of the argument, and Λi for each i ∈ Γ is the p-dimensional cube Λ translated

to be centered at location i. The translated p-dimensional cubes Λii∈Γ are referred to as

sliding windows, which will be used to subset a vector, a matrix, or an 3D array. The effective

observation at iteration t is V−1(A∗zt)+xt ∈ RΓ, which can be approximated as the true signal x

plus i.i.d. Gaussian noise (in a sense that will be made clear in the statement of our main result,

Theorem 2.2.1). Note that the sliding-windows Λii∈Γ and the sliding-window denoiser ηt are

defined on multidimensional signals, hence we use the inverse of the vectorization operator,

V−1, to rearrange elements of vectors into arrays before applying the sliding-window denoiser

ηt. It should also be noted that the denoiser ηt may only process part of the signal elements in

Λ. For example, in the 2D case, if Λ is defined as a 3× 3 window, then ηt may only process the

center and the four adjacent pixel values in the window (see Figure 2.1) and ignore the four

corners. To simplify notation, we will write ηt : RΛ → R throughout the chapter, and interpret

this notation to mean that any processing of neighboring signal values is allowed, including the

possibility of ignoring some of their values.

Edge cases: Notice that when the center coordinate i is near the edges of Γ, some of the

elements in Λi may fall outside Γ, meaning that Λi ∩ Γc 6= ∅, where Γc is the complement of Γ

with respect to (w.r.t.) Zp. Based on whether Λi has elements outside Γ, we partition the index

7

𝑖

Figure 2.1 For Λ of size 3 × 3, the denoiser ηt : RΛ → R may only process the pixels in gray (thecenter and the four adjacent pixels).

set Γ into two sets Γmid and Γedge defined as:

Γmid := i ∈ Γ|Λi ∩ Γc = ∅, and Γedge := i ∈ Γ|Λi ∩ Γc 6= ∅. (2.8)

That is, for i ∈ Γmid , all elements in Λi are inside Γ, whereas for i ∈ Γedge , some of the elements

in Λi fall outside Γ.

For any v ∈ RΓ, let vΛi be a subset of the elements of v with indices in Λi. Notice that for

i ∈ Γmid , all entries of vΛi are well-defined. For i ∈ Γedge , vΛi has undefined entries, namely,

for all j ∈ Λi ∩ Γc, vj is not defined. We now define the value of those “missing” entries to be

the average of the entries of vΛi with indices in Γ. Formally,

vj :=1

|Λi ∩ Γ|∑

`∈Λi∩Γ

v`, ∀j ∈ Λi ∩ Γc. (2.9)

Notice that vΛi for all i ∈ Γ are now defined by the entries in the original v ∈ RΓ. To emphasize

this point, which will be useful in the proof for our main result, we define a set of operators

Tii∈Γ with Ti : RΛi∩Γ → RΛ defined as

Ti(vΛi∩Γ) := vΛi , ∀i ∈ Γ, (2.10)

where vΛi follows our definition above for all i ∈ Γ. That is, Ti is identity for i ∈ Γmid , whereas

for i ∈ Γedge , Ti extends a smaller array vΛi∩Γ to a larger one vΛi with the extended entries

defined by (2.9).

Examples for defining “missing” entries: To illustrate the notations defined above, we

present an example for the p = 1 case (hence v ∈ RN is a vector) below. As defined above in (2.4)

and (2.5), we have Γ = 1, . . . , N, Λ = 1, . . . , 2k+1, and Λi = (i−k, . . . , i−1, i, i+1, . . . , i+k)

for each i ∈ [N ]. Moreover, Γmid = k + 1, k + 2, . . . , N − k and Γedge = 1, 2, . . . , k ∪ N −

8

ҧ𝑣 ҧ𝑣 ҧ𝑣

ҧ𝑣 𝑣1,1 𝑣1,2 𝑣1,3 𝑣1,4

ҧ𝑣 𝑣2,1 𝑣2,2 𝑣2,3 𝑣2,4

𝑣3,1 𝑣3,2 𝑣3,3 𝑣3,4

𝑣4,1 𝑣4,2 𝑣4,3 𝑣4,4

Figure 2.2 Illustration of the definition of “missing” entries in a sliding window in Z2. The matrixv ∈ R4×4. The half-window size is k = 1, thus Λ = [3] × [3]. For the window Λ(1,1) centered atcoordinate (1, 1), the “existing” entries in the window are v1,1, v1,2, v2,1, v2,2 as shown in dark gray.Five entries, which are in light gray, are missing, hence we define their value to be the average of theexisting ones, v = 1

4 (v1,1 + v1,2 + v2,1 + v2,2).

k + 1, N − k + 2, . . . , N as defined in (2.8). Therefore, for all i ∈ Γmid , we have

vΛi = Ti(vi−k, vi−k+1, . . . , vi+k) = (vi−k, vi−k+1, . . . , vi+k) ∈ R2k+1.

For i ∈ Γedge , the vector vΛi is still length-(2k + 1), and we set the values of the non-positive

indices, i.e., 1− k, 2− k, . . . ,−1, 0, or indices above N , i.e., N + 1, N + 2, . . . , N + k, to be the

average value of the vector vΛi with indices in Λi ∩ [N ]. For example, let i = 3 and k = 5, then

Λ3 = (−2,−1, 0, 1, . . . , 8). Following (2.9), define

v =1

8

8∑j=1

vj , and set v−2 = v−1 = v0 = v,

hence, vΛ3 = T3(v1, . . . , v8) := (v, v, v, v1, . . . , v8) ∈ R11. An example for the p = 2 case (hence

v ∈ RN×N is a matrix) in Figure 2.2.

9

2.2 Performance Analysis

2.2.1 Definitions and Assumptions

First we include some definitions relating to Markov random field (MRF) that will be used to

state our assumptions on the unknown signal x. These definitions can be found in standard

textbooks such as [46]; we include them here for convenience.

Definition 2.2.1. Let (Ω,F , P ) be a probability space. A random field is a collection of random

variables X = Xii∈Γ defined on (Ω,F , P ) having spatial dependencies, where Xi : Ω → E

for some measurable state space (E, E) and Γ ⊂ Zp is a non-empty and finite subset of the

infinite lattice Zp. We think of Γ as a collection of spatial locations. Denote the qth-order

neighborhood of location i ∈ Γ by N qi , that is, N q

i ⊂ Γ is a collection of location indices at a

distance less than or equal to q from i but not including i. Formally,

N qi =

j ∈ Γ \ i | ‖i− j‖2 ≤ q

.

Following these definitions, X is said to be a qth-order MRF if, for all i ∈ Γ, hence i =

(i1, . . . , ip), and for all measurable subsets B ∈ E, we have

P (Xi ∈ B |Xj , j ∈ Γ \ i) = P (Xi ∈ B |Xj , j ∈ N qi ),

and for all B ∈ EΓ we have P (X ∈ B) > 0. The second positivity condition ensures that the

joint distribution of a MRF is a Gibbs distribution by the Hammersley-Clifford theorem [50].

Let µ denote the distribution measure of X, namely for all B ∈ EΓ, we have P (X ∈ B) =

µ(B), and µΛ the distribution measure of XΛ := Xii∈Λ for Λ ⊂ Γ. For any i ∈ Γ, define the

set i+ Λ := i+ j | j ∈ Λ. Then the random field is said to be stationary if for all i ∈ Γ such

that i+ Λ ⊂ Γ, it is true that uΛ = ui+Λ.

Next we introduce the Dobrushin uniqueness condition, which ensures that the random

field mixes sufficiently fast and leads to a unique stationary Gibbs distribution. Define the Do-

brushin interdependence matrix (Ci,j)i,j∈Γ for the distribution measure µ of the random

field X to be

Ci,j := supξ, ξ′ ∈EΓ

ξjc = ξ′jc

‖µi(·|ξ)− µi(·|ξ′)‖tv. (2.11)

In the above, the index set jc := Γ \ j and the total variation distance ‖ · ‖tv between two

10

probability measures ρ1 and ρ2 on (E, E) is defined as

‖ρ1(·)− ρ2(·)‖tv := maxA∈E|ρ1(A)− ρ2(A)| .

Note that if E is countable, then

‖ρ1(·)− ρ2(·)‖tv =1

2

∑x∈E|ρ1(x)− ρ2(x)| . (2.12)

The measure µ is said to satisfy the Dobrushin uniqueness condition if

c := supi∈Γ

∑j ∈Γ

Ci,j < 1.

The Dobrushin contraction coefficient, c, is a quantity that estimates the magnitude of

change of the single site conditional expectations, as they appear in (2.11), when the field values

at the other sites vary. Similarly, we define the transposed Dobrushin contraction condi-

tion as

c∗ := supj ∈Γ

∑i∈Γ

Ci,j < 1.

We can now state our assumptions on the signal x, the matrix A, and the noise w in the

linear system (1.2), as well as the denoiser function ηt used in the algorithm (2.6) and (2.7).

Signal: Let E ⊂ R be a bounded state space (countable or uncountable). Let x = xii∈Γ

be a stationary MRF with Gibbs distribution measure µ on EΓ, where Γ ⊂ Zp with p = 1, 2, 3 is

a finite and nonempty rectangular lattice. We assume that µ satisfies the Dobrushin uniqueness

condition and the transposed Dobrushin uniqueness condition as defined in 2.2.1. The class of

finite state space stationary MRFs, which is widely used for image analysis [73], is one example

that satisfies our assumption.

Denoisers: The denoisers ηt : RΛ → R used in (2.7) are assumed to be Lipschitz3 for each

t > 0 and are, therefore, weakly differentiable with bounded (weak) partial derivatives. We

further assume that the partial derivative w.r.t. the center coordinate of Λ, which is denoted

by η′t : RΛ → R, is itself differentiable with bounded partial derivatives. Note that this implies

η′t is Lipschitz. (It is possible to weaken this condition to allow η′t to have a finite number of

discontinuities, if needed, as in [104].)

Matrix: The entries of the matrix A are i.i.d. with distribution N (0, 1/n).

3A function f : Rm → R is Lipschitz if there exists a constant L > 0 such that for all x,y ∈ Rm,|f(x)− f(y)| ≤ L ‖x− y‖.

11

Noise: The entries of the measurement noise vector w are i.i.d. according to some sub-

Gaussian distribution pw with mean 0 and finite variance σ2w. The sub-Gaussian assumption

implies [17] that for all ε > 0, there exist some constants K,κ > 0 such that

P

(∣∣∣∣ 1n‖w‖2 − σ2w

∣∣∣∣ ≥ ε) ≤ Ke−κnε2 . (2.13)

2.2.2 Main Result

As noted in Section 1.1, the behavior of the AMP algorithm is predicted by a deterministic

scalar recursion referred to as state evolution, which is now formally introduced here. More

specifically, the state evolution sequences τ2t t≥0 and σ2

t t≥0 defined below in (2.14) will be

used in Theorem 2.2.1 to characterize the estimation error of the estimates produced by AMP.

Let the probability measure µ define the (stationary) prior distribution for the unknown signal

x in (1.2). Then by our assumption of stationarity, we have xi ∼ µ1 for all i ∈ Γ and xΛi ∼ µΛ

for all i ∈ Γmid with Γmid defined in (2.8), where µ1 and µΛ denote the one-dimensional marginal

and Λ-dimensional marginal of µ, respectively. Define σ2x = E[x2

1] > 0, and σ20 = σ2

x/δ. Iteratively

define τ2t t≥0 and σ2

t t≥1 as follows,

τ2t = σ2

w + σ2t and σ2

t =1

δ|Γ|∑i∈Γ

E[(ηt−1(xΛi + τt−1ZΛi)− xi)

2], (2.14)

where ηt : RΛ → R is the sliding-window denoiser and Z = Zii∈Γ has i.i.d. N (0, 1) entries

and is independent of x. We notice that for all i ∈ Γmid , xΛid= x′, where x′ = x′ii∈Λ ∼ µΛ,

and [Z]Λid= Z′, where Z′ = Z ′ii∈Λ has i.i.d. N (0, 1) entries. Therefore, for all i ∈ Γmid , the

expectations in (2.14) satisfy

E[(ηt−1(xΛi + τt−1ZΛi)− xi)

2]

= E[(ηt−1(x′ + τt−1Z

′)− x′c)2]

,

where x′c is the center coordinate of x′. For i ∈ Γedge with Γedge defined in (2.8), it is not

necessarily true that xΛid= x′, because following the definition in (2.9) some of the entries of

xΛi are defined as the average of other entries.

The explicit expression for the definition of σ2t in (2.14) is different when considering Γ ⊂ Zp

for different p values, because the size and the patterns of edges and corners of the set Λi for

i ∈ Γedge depends on the dimension. In the following, we provide explicit expressions for σ2t for

the cases p = 1, 2, however in the proof we will use the general expression given in (2.14) for

brevity. We emphasize that the definition of the state evolution sequence in (2.14) only uses

12

the Λ-dimensional marginal measure µΛ instead of the joint measure µ, as demonstrated in the

two examples below in (2.15) and (2.16).

Let x′c be the center coordinate of x′ and Λc the window Λ ⊂ Zp translated with center

c ∈ Zp. Recall that Λ is the p-dimensional cube with length (2k+1) in each of the p dimensions.

Then we have x′ = x′Λc and when we consider shifts x′Λc+` for ` ∈ −k,−k+ 1, . . . , k− 1, k we,

analogous to the definition in (2.9), define “missing” entries to be replaced by the average of the

existing entries. (Note that x′ is exactly of size Λ, thus for any ` 6= 0, there will be “missing”

entries.) For example, when p = 1,

x′Λc = (x′1 , x′2 , . . . , x

′2k+1 ), while x′Λc−2

= ( x , x , x′1 , x′2 , . . . , x

′2k−1 ),

where x = 12k−1

∑2k−1i=1 x′i. Generalizing, we have random vector x′Λc+` of length 2k + 1 defined

as

x′Λc+` =

(

12k+1+`

∑2k+1+`i=1 x′i , . . . ,

12k+1+`

∑2k+1+`i=1 x′i , x

′1 , x

′2 , . . . , x

′2k+1+`

)if ` < 0,(

x′1 , x′2 , . . . , x

′2k+1

)if ` = 0,(

x′1+` , x′2+` , . . . , x

′2k+1 ,

12k+1−`

∑2k+1i=1+` x

′i , . . . ,

12k+1−`

∑2k+1i=1+` x

′i

)if ` > 0.

The same idea can be extended easily when p = 2 or p = 3.

For the case p = 1, we note that Γmid = k + 1, k + 2, . . . , N − k − 1 and Γedge =

1, 2, . . . , k ∪ N − k,N − k + 1, . . . , N, hence |Γmid | = N − 2k and |Γedge | = 2k. Therefore,

we have

σ2t =

(N − 2k)

δNE[(ηt−1(x′ + τt−1Z

′)− x′c)2]

+1

δN

∑`∈−k,...,k\0

E[(ηt−1([x′ + τt−1Z]′Λc+`)− x

′c+`

)2], (2.15)

where −k, . . . , k \ 0 = −k, . . . ,−1 ∪ 1, . . . , k. In the above the first term correspond to

the N − 2k middle indices, while the second term sums over 2k terms, which correspond to all

the possible edge cases.

For the case p = 2, we note that Γmid = (i, j) | k + 1 ≤ i, j ≤ N − k + 1, hence |Γmid | =(N − 2k)2. Here we denote ` = (`1, `2) ∈ −k,−k+ 1, . . . , k− 1, k×−k,−k+ 1, . . . , k− 1, k.

13

Therefore,

σ2t =

(N − 2k)2

δN2E[(ηt−1(x′ + τt−1Z

′)− x′c)2]

+1

δN2

∑`1,`2∈−k,...,k\0

E[(ηt−1([x′ + τt−1Z

′]Λc+`)− x′c+`

)2]+

(N − 2k)

δN2

∑`1∈−k,...,k\0

`2=0

E[(ηt−1([x′ + τt−1Z

′]Λc+`)− x′c+`

)2]

+(N − 2k)

δN2

∑`2∈−k,...,k\0

`1=0

E[(ηt−1([x′ + τt−1Z

′]Λc+`)− x′c+`

)2], (2.16)

where we notice that there are (2k)2 terms in the second summand and 2k terms in the third

and fourth summands, and that (N−2k)2

N2 + (2k)2

N2 + 2k(N−2k)N2 + 2k(N−2k)

N2 = 1. Again, in the above

the first term sums over all the middle indices. In this case, the second term corresponds to the

corner edge cases, while the third and fourth terms correspond to the edge cases in one dimension

only. Note that σ2t is a function of N , but we do not explicitly represent this relationship to

simplify the notation. Note also that for fixed k, the terms (2k)2

N2 , 2k(N−2k)N2 , and 2k(N−2k)

N2 vanish

as N goes to infinity. Therefore, we have limN→∞ σ2t (N) = 1

δE[(ηt−1(x′ + τt−1Z

′)− x′c)2].

Similar to [104], our performance guarantee, Theorem 2.2.1, is a concentration inequality

for PL(2) loss functions at any fixed iteration t < T ∗, where T ∗ is the first iteration when

either (σ⊥t )2 or (τ⊥t )2 defined in (2.37) is smaller than a predefined quantity ε. The precise

definition of (σ⊥t )2 and (τ⊥t )2 is deferred to Section 2.3.2. For now, we can understand (σ⊥t )2

(respectively, (τ⊥t )2) as a number that quantifies (in a probability sense) how close an estimate

xt (respectively, a residual zt) is in the subspace spanned by the previous estimates xss<t(respectively, the previous residuals zss<t).

Theorem 2.2.1. Under the assumptions stated in Section 2.2.1, and for fixed half window-size

k > 0, then for any (order-2) pseudo-Lipschitz function φ : R2 → R, ε ∈ (0, 1), and 0 ≤ t < T ∗,

P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

(φ(xt+1

i , xi)− E [φ(ηt(xΛi + τtZΛi), xi)])∣∣∣∣∣ ≥ ε

)≤ Kk,te

−κk,tnε2 , (2.17)

where x = xii∈Γ is an MRF with distribution measure µ on EΓ, Z = Zii∈Γ has i.i.d.

N (0, 1) entries and is independent of x, and the deterministic quantity τt is defined in (2.14).

Constants Kk,t, κk,t > 0 do not depend on n or ε, but do depend on k and t. Their values are

14

not explicitly specified.

Proof. See Section 2.3.

Remarks:

(1) The probability in (2.17) is w.r.t. the product measure on the space of the matrix A,

signal x, and noise w.

(2) By choosing pseudo-Lipschitz loss function to be φ(a, b) = (a− b)2, Theorem 2.2.1 gives

the following concentration result for the mean squared error of the estimates. For any t ≥ 0,

P

(∣∣∣∣ 1

|Γ|‖xt+1 − x‖2 − δσ2

t+1

∣∣∣∣ ≥ ε) ≤ Kk,te−κk,tnε2 ,

with σ2t+1 defined in (2.14).

2.2.3 Numerical Examples

Before moving to the proof of Theorem 2.2.1, we first demonstrate the effectiveness of the AMP

algorithm with sliding-window denoisers when used to reconstruct an image x from its linear

measurements acquired according to (1.2). We verify that state evolution accurately tracks the

normalized estimation error of AMP, as is guaranteed by Theorem 2.2.1. We use squared error as

the error metric in our examples, which corresponds to the case where the PL(2) loss function φ

in Theorem 2.2.1 is defined as φ(a, b) := (a−b)2. We remind the reader that Theorem 2.2.1 also

supports other PL(2) loss functions. Moreover, we apply AMP with sliding-window denoisers

to reconstruct texture images, which are known to be well-modeled by MRFs in many cases

[30, 40].

2.2.3.1 Verification of state evolution

We consider a class of stationary MRFs on Z2 whose neighborhood is defined as the eight-

nearest neighbors, meaning this is a 2nd-order MRF according to the definition in Section 2.2.1.

The joint distribution of such an MRF on any finite M × N rectangular lattice in Z2 has the

following expression [23]:

µ(a) = P (x = a) =

∏M−1m=1

∏N−1n=1

[am,n am,n+1

am+1,n am+1,n+1

]∏M−1m=2

∏N−1n=2

[am,n

]∏M−1m=2

∏N−1n=1

[am,n am,n+1

]∏M−1m=1

∏N−1n=2

[am,n

am+1,n

] ,

15

Figure 2.3 Numerical example. From left to right: ground-truth image generated by the MRF de-scribed in Section 2.2.3.1, image reconstructed by AMP with a separable Bayesian denoiser (computedfrom the incorrect assumption that the signal is generated from an i.i.d. Bernoulli distribution), andimage reconstructed by AMP with a Bayesian sliding-window denoiser with k = 1, hence Λ = [3]× [3].(Γ = [128]× [128], δ = 0.5, SNR = 17 dB.)

where we follow the notation in [23] for the generic measure

[am,n am,n+1

am+1,n am+1,n+1

]defined as

[am,n am,n+1

am+1,n am+1,n+1

]:= P (xm,n = am,n, xm,n+1 = am,n+1, xm+1,n = am+1,n, xM+1,n+1 = am+1,n+1),

and the conditional distribution of the element in the box given the element(s) not in the box:[am,n am,n+1

am+1,n am+1,n+1

]:= P (xm+1,n+1 = am+1,n+1|xm,n = am,n, xm+1,n = xm+1,n, xm,n+1 = am,n+1).

The generic measure needs to satisfy some consistency conditions to ensure the Markovian prop-

erty and stationarity of the MRF on a finite grid; details can be found in [23]. For convenience

in simulations, we use a Π+ Binary MRF as defined in [23, Definition 7], for which the generic

measure is conveniently parameterized by four parameters, namely,

[1 0 ] = p, [0 1 ] = q,

[0 0

1 0

]= r,

[1 1

1 0

]= s.

In the simulations, we set p = 0.4, q = 0.5, r = 0.01, s = 0.4. Using (2.11) and (2.12), it

can be checked that the distribution measure of this MRF satisfies the Dobrushin uniqueness

condition.

As mentioned previously, an attractive property of AMP, which is formally stated in The-

orem 2.2.1, is the following: for large n, |Γ| and for i ∈ Γ, the observation vector [A∗zt + xt]Λi

16

0 5 10 15 20 25Iteration

10-4

10-2

100

MS

Estate evolution k=0state evolution k=1empirical

Figure 2.4 Numerical verification that the empirical MSE achieved by AMP with sliding-window de-noisers is tracked by state evolution. The empirical MSE is averaged over 50 realizations of the MRF(as described in Section 2.2.3.1), measurement matrix, and measurement noise. (Γ = [128] × [128],δ = 0.5, SNR = 17 dB.)

used as an input to the estimation function in (2.7) is approximately distributed as x′ + τtZ′,

where x′ ∼ µΛ, Z′ has i.i.d. standard normal entries and is independent of x′, and τt is defined

in (2.14). With this property in mind, a natural choice of denoisers ηtt≥0 are those that cal-

culate the conditional expectation of the signal given the value of the input argument, which

we refer to as Bayesian sliding-window denoisers. Let v ∈ RΛ, for each t ≥ 0 we define

ηt(v) := E[x′c∣∣x′ + τtZ

′ = v], (2.18)

where x′c denotes the center coordinate of x′. Figure 2.4 shows that the MSE achieved by AMP

with the non-separable sliding-window denoiser defined above is tracked by state evolution at

every iteration.

Notice that when k = 0, the denoisers ηtt≥0 are separable and the empirical distribution

of any realization of x converges to the µ1 on E ⊂ R. For this case, the state evolution analysis

for AMP with separable denoisers (k = 0) was justified by Bayati and Montanari [7]. However,

it can be seen in Figures 2.3 and 2.4 that the MSE achieved by the separable denoiser (k = 0)

is significantly higher (worse) than that achieved by the non-separable denoiser (k = 1).

2.2.3.2 Texture Image Reconstruction

We now use the Bayesian sliding-window denoiser defined in (2.18) to reconstruct binary texture

images. The MRF prior is the same type as before, namely the Π+ Binary MRF, but we set the

parameters p = 0.18, q = 0.16, r = 0.034, s = 0.01. Note that while we may learn an MRF

17

Figure 2.5 Reconstruction of texture images using AMP with different denoisers. From left to right:original gray level images, binary ground-truth images, images reconstructed by AMP with a totalvariation denoiser [8], non-separable Bayesian sliding-window denoiser (MRF prior, k = 1), and sepa-rable Bayesian denoiser (Bernoulli prior), respectively. From top to bottom: images of cloud, leaf, andwood, respectively. (Γ = [128]× [128], δ = 0.3, SNR = 20 dB.)

model for each of these images using well-established MRF learning algorithms, we do not

include this procedure in our simulations, since the study of texture image modeling is beyond

the scope of this work and the reconstruction results obtained using the simple MRF defined

above are sufficiently satisfactory even though the prior may be inaccurate. In Figure 2.5, we

take nature images of cloud, leaf, and wood (1st column) and use thresholding to generate the

binary testing images (2nd column). In addition to presenting the reconstructed images obtained

by AMP with the Bayesian sliding-window denoisers with k = 1 (4th column) and k = 0 (5th

column), respectively, we also present those obtained by AMP with a total variation denoiser

[8] as a baseline approach (3th column).

2.3 Proof of Theorem 2.2.1

The proof of Theorem 2.2.1 follows the work of Rush and Venkataramanan [104], with modifi-

cations for the dependent structure of the unknown vector x in (1.2). For this reason, we use

much of the same notation. We prove Theorem 2.2.1 using a technical lemma (Lemma 2.3.4),

which corresponds to [104, Lemma 4.5]. Before stating the lemma, we cover some preliminary

results and establish notation to be used in its proof.

18

2.3.1 Proof Notation

As in the work by Rush and Venkataramanan [104], as well as by Bayati and Montanari [7], the

technical lemma is proved for a more general recursion, with AMP being a specific example of

the general recursion. The connection between AMP and the general recursion is explained in

(2.26) and (2.27).

Fix the half-window size 0 ≤ k ≤ (N − 1)/2, an integer. Let ftt≥0 : RΛ×Λ → R and

gtt≥0 : R2 → R be sequences of Lipschitz functions. Specifically, the arguments of ft are

two variables in RΛ, for example, for x,y ∈ RΛ, we write ft(x,y) and refer to x as the first

argument of ft. Given the noise w and the unknown signal x, define R|Γ|-valued random vectors

ht+1,qt+1 and Rn-valued random vectors bt,mt, as well as RΓ-valued random fields ht, qt,

where ht+1 = V(ht+1) and qt+1 = V(qt+1), for t ≥ 0 recursively as follows. Starting with initial

condition q0 ∈ RΓ:

ht+1 := A∗mt − ξtqt, qt := V(qt),

ht+1 := V−1(ht+1), qti := ft

(htΛi , xΛi

), for all i ∈ Γ,

bt := Aqt − λtmt−1, mti := gt(b

ti, wi), for all i ∈ [n],

(2.19)

with the scalars ξt, λt defined as

ξt :=1

n

n∑i=1

g′t(bti, wi), λt :=

1

n

∑i∈Γ

f ′t

(htΛi ,xΛi

), (2.20)

where the derivative of gt is w.r.t. the first argument, and the derivative of ft is w.r.t. the

center coordinate of the first argument. In the context of AMP, as made explicit in (2.26), the

terms ht+1 and qt measure the error in the observation V−1(A∗zt) + xt and the estimate xt

at iteration t, respectively, (the error w.r.t. the true x). The term mt measures the residual at

iteration t and the term bt is the difference between the noise and residual at iteration t.

Recall that the unknown signal x is assumed to have a stationary MRF prior with distri-

bution measure µ on EΓ. Let 0 ∈ RΛ be an all-zero array. Define

σ2x := E[x2

1], (2.21)

σ20 :=

1

δ|Γ|∑i∈Γ

E[f20 (0,xΛi)]. (2.22)

19

Further, for all i ∈ Γ let

q0i := f0(0,xΛi), and q0 := V(q0), (2.23)

and assume that there exist constants K,κ > 0 such that

P

(∣∣∣∣ 1n ∥∥q0∥∥2 − σ2

0

∣∣∣∣ ≥ ε) ≤ Ke−κnε2 . (2.24)

Define the state evolution scalars τ2t t≥0 and σ2

t t≥1 for the general recursion as follows.

τ2t := E[(gt(σtZ,W ))2], σ2

t :=1

δ|Γ|∑i∈Γ

E[(ft(τt−1ZΛi ,xΛi))2], (2.25)

where random variables W ∼ pw and Z ∼ N (0, 1) are independent, and x = Xii∈Γ ∼ µ and

Z = Zii∈Γ with i.i.d. N (0, 1) entries are also independent. We assume that both σ20 and τ2

0

are strictly positive. The technical lemma will show that ht+1 can be approximated as i.i.d.

N (0, τ2t ) in functions of interest for the problem, namely when used as an input to pseudo-

Lipschitz functions, and bt can be approximated as i.i.d. N (0, σ2t ) in PL functions. Moreover, it

will be shown that the probability of the deviations of the quantities 1n‖m

t‖2 and 1n‖q

t‖2 from

τ2t and σ2

t , respectively, decay exponentially in n.

We note that the AMP algorithm introduced in (2.6) and (2.7) is a special case of the

general recursion of (2.19) and (2.20). Indeed, define the following vectors recursively for t ≥ 0,

starting with x0 = 0 and z0 = y.

ht+1 = x− (V−1(A∗zt) + x0), qt = xt − x,

bt = w − zt, mt = −zt.(2.26)

It can be verified that these vectors satisfy (2.19) and (2.20) using Lipschitz functions

ft(a,xΛi) = ηt−1(xΛi − a)− xi, and gt(b, wi) = b− wi, (2.27)

where a ∈ RΛ and b ∈ R. Using the choice of ft, gt given in (2.27) also yields the expressions

for σ2t , τ

2t given in (2.14). In the remaining analysis, the general recursion given in (2.19) and

(2.20) is used. Note that in AMP, q0 = −x and σ20 = (1/δ)σ2

x, hence, assumption (2.24) for

AMP requires

P

(∣∣∣∣ 1

|Γ|‖x‖2 − σ2

x

∣∣∣∣ ≥ δε) ≤ Ke−κnδ2ε2 . (2.28)

20

Under our assumptions for x as stated in Section 2.2.1, we see that (2.28) is satisfied using

Lemma A.3.2 (Appendix A.3), since the function f(x) = x2 is pseudo-Lipschitz. Finally, note

that if we assume σ2x > 0 and δ <∞, then the condition of strict positivity of σ2

0 and τ20 defined

in (2.25) is satisfied.

Let [c1 | c2 | . . . | ck] denote a matrix with columns c1, . . . , ck. For t ≥ 1, define matrices

Mt := [m0 | . . . |mt−1], Qt := [q0 | . . . | qt−1], Bt := [b0| . . . |bt−1], Ht := [h1| . . . |ht]. (2.29)

Moreover, M0, Q0, B0, H0 are defined to be the all-zero vector.

The values mt‖ and qt‖ are projections of mt and qt onto the column space of Mt and Qt,

with mt⊥ := mt−mt

‖, and qt⊥ := qt−qt‖ being the projections onto the orthogonal complements

of Mt and Qt. Finally, define the vectors

αt := (αt0, . . . , αtt−1)∗, γt := (γt0, . . . , γ

tt−1)∗, (2.30)

to be the coefficient vectors of the parallel projections, i.e.,

mt‖ :=

t−1∑i=0

αtimi, qt‖ :=

t−1∑i=0

γtiqi. (2.31)

The technical lemma, Lemma 2.3.4, shows that for large n, the entries of the vectors αt and γt

concentrate to constant values which are defined in the following section.

2.3.2 Concentrating Constants

Recall that x is the unknown vector to be recovered and w is the measurement noise in the linear

model (1.2). In this section we introduce the concentrating values for various inner products of

pairs of the vectors ht,mt,qt,bt that are used in Lemma 2.3.4.

Let Ztt≥0 be a sequence of zero-mean jointly Gaussian R-valued random variables , and

let Ztt≥0 be a sequence of zero-mean jointly Gaussian RΓ-valued random variables. The

covariance of the two random sequences is defined recursively as follows. For r, t ≥ 0, i, j ∈ Γ,

E[ZrZt] =Er,tσrσt

, E[[Zr]i[Zt]j

]=

Er,tτrτt

, if i = j

0, if i 6= j, (2.32)

21

where

Er,t := E[gr(σrZr,W )gt(σtZt,W )

],

Er,t :=1

δ|Γ|∑i∈Γ

E[fr(τr−1[Zr−1]Λi ,xΛi)ft(τt−1[Zt−1]Λi ,xΛi)

]. (2.33)

Note that both terms of the above (2.33) are scalar values and we take f0(·,xΛi) := f0(0,xΛi),

the initial condition. Moreover, Et,t = σ2t and Et,t = τ2

t , as can be seen by comparing (2.25)

and (2.33), thus for all i ∈ Γ, we have E[[Zt]

2i

]= E[Z2

t ] = 1. Therefore, Zt has i.i.d. N (0, 1)

entries.

Next, we define matrices Ct, Ct ∈ Rt×t and vectors Et, Et ∈ Rt whose entries are Er,tr,t≥0

and Er,tr,t≥0 defined in (2.33). For 0 ≤ i, j ≤ t− 1, define

Cti+1,j+1 := Ei,j , Cti+1,j+1 := Ei,j , (2.34)

and

Et := (E0,t . . . , Et−1,t)∗, Et := (E0,t . . . , Et−1,t)

∗. (2.35)

Lemma 2.3.1 below shows that Ct and Ct are invertible. Therefore, we can define the concen-

trating values for γt and αt defined in (2.30) as

γt := (Ct)−1Et and αt := (Ct)−1Et, (2.36)

as well as the values of (σ⊥t )2 and (τ⊥t )2 for t > 0:

(σ⊥t )2 := σ2t − (γt)∗Et = Et,t − E∗t (C

t)−1Et,

(τ⊥t )2 := τ2t − (αt)∗Et = Et,t − E∗t (C

t)−1Et.(2.37)

For t = 0, we let (σ⊥0 )2 := σ20 and (τ⊥0 )2 := τ2

0 . Finally, define the concentrating values for λt+1

and ξt defined in (2.20) as

ξt := E[g′t(σtZt,W )

], λt+1 :=

1

δ|Γ|∑i∈Γ

E[f ′t(τt[Zt]Λi ,xΛi)

]. (2.38)

Lemma 2.3.1. If (σ⊥k )2 and (τ⊥k )2 are bounded below by some positive constants for k ≤ t,

then the matrices Ck+1 and Ck+1 defined in (2.34) are invertible for k ≤ t.

22

Proof. The proof can be found in [104].

2.3.3 Conditional Distribution Lemma

As mentioned before, the proof of Theorem 2.2.1 relies on a technical lemma, Lemma 2.3.4,

which will be stated in Section 2.3.4 and proved in Section 2.4. Lemma 2.3.4 uses the conditional

distribution of the vector ht+1 given the matrices in (2.29) as well as x,w. Two forms of the

conditional distribution of ht+1 will be provided in Lemmas 2.3.2 and 2.3.3, which correspond

to [104, Lemma 4.3] and [104, Lemma 4.4], respectively. Lemma 2.3.3 explicitly shows that the

conditional distribution of ht+1 can be represented as the sum of a standard Gaussian vector

and a deviation term, where the explicit expression of the deviation term is provided in Lemma

2.3.2. Then Lemma 2.3.4 shows that the deviation term is small, meaning that its normalized

Euclidean norm concentrates on zero, and also provides concentration results for various inner

products involving the other terms in recursion (2.19), namely ht+1,qt,bt,mt.The following notation is used. Considering two random variables X,Y and a sigma-algebra

S , we denote the relationship that Y andX given S are equivalent in distribution byX|Sd= Y .

We represent a t × t identity matrix as It, dropping the subscript t when it is clear from the

context. For a matrix A with full column rank, P‖A := A(A∗A)−1A∗ is the orthogonal projection

matrix onto the column space of A, and P⊥A := I − P‖A. Define St1,t2 to be the sigma-algebra

generated by the terms

b0, ...,bt1−1,m0, ...,mt1−1,h1, ...,ht2 ,q0, ...,qt2 , and x,w.

Lemma 2.3.2. [104, Lemma 4.3] For the vector ht+1 defined in (2.19), the following conditional

distribution holds for t ≥ 0:

h1|S1,0

d= τ0Z0 + ∆1,0 and ht+1|St+1,t

d=

t−1∑r=0

αtrhr+1 + τ⊥t Zt + ∆t+1,t, (2.39)

where Z0,Zt are R|Γ|-valued random variables with i.i.d. N (0, 1) entries that are independent

of the corresponding conditioning sigma algebras. The term αti for i = 0, ..., t − 1 is defined in

(2.36) and the term (τ⊥t )2 in (2.37). The deviation terms are

∆1,0 =

[(∥∥m0∥∥

√n− τ0

)IN −

∥∥m0∥∥

√n

P‖q0

]Z0 + q0

(∥∥q0∥∥2

n

)−1((b0)∗m0

n− ξ0

∥∥q0∥∥2

n

), (2.40)

23

and for t > 0,

∆t+1,t =

t−1∑r=0

(αtr − αtr)hr+1 +

[(∥∥mt⊥∥∥

√n− τ⊥t

)IN −

∥∥mt⊥∥∥

√n

P‖Qt+1

]Zt

+ Qt+1

(Q∗t+1Qt+1

n

)−1(

B∗t+1mt⊥

n−

Q∗t+1

n

[ξtq

t −t−1∑i=0

ξiαtiqi

]). (2.41)

Proof. The proof can be found in [104].

Note that Lemma 2.3.2 holds only when Q∗t+1Qt+1 is invertible. The following lemma pro-

vides an alternative representation of the conditional distribution of ht+1|St+1,t for t ≥ 0, and it

explicitly shows that ht+1|St+1,t is distributed as an i.i.d. Gaussian random vector with N (0, τ2t )

entries plus a deviation term.

Lemma 2.3.3. For t ≥ 0, let Zt ∈ R|Γ| be i.i.d. standard normal random vectors. Let h1pure :=

τ0Z0. For t ≥ 1, recursively define

ht+1pure =

t−1∑r=0

αtrhr+1pure + τ⊥t Zt (2.42)

and a set of scalars dti0≤i≤t with d00 = 1,

dti =t−1∑r=i

dri αtr for 0 ≤ i ≤ (t− 1), and dtt = 1. (2.43)

Let ht+1pure = V−1(ht+1

pure) ∈ RΓ. Then for all t ≥ 0 we have

(h1pure, . . . , h

t+1pure)

d= (τ0Z0, . . . , τtZt), (2.44)

where Ztt≥0 are jointly Gaussian with correlation structure defined in (2.32). Moreover,

ht+1|St+1,t

d= ht+1

pure +t∑

r=0

dtr∆r+1,r. (2.45)

Proof. First, we prove (2.44) by induction. For t = 1, h1pure = τ0V−1(Z0)

d= τ0Z0. As the

inductive hypothesis, assume (h1pure, . . . , h

tpure)

d= (τ0Z0, . . . , τt−1Zt−1). By (2.42), term ht+1

pure

24

is equal in distribution to∑t−1

r=0 αtrτrZr + τ⊥t Z, where Z ∈ RΓ is independent of Zr for all

r = 0, . . . , t− 1. In what follows, we show(τ0Z0, . . . , τt−1Zt−1,

t−1∑r=0

αtrτrZr + τ⊥t Z

)d= (τ0Z0, . . . , τt−1Zt−1, τtZt).

Note that Z0, . . . , Zt−1,Z are all zero-mean Gaussian, and therefore so is the sum. We now

study the variance and covariance of∑t−1

r=0 αtrτrZr + τ⊥t Z by demonstrating the following two

results:

(1) For all i, j ∈ Γ,

E

[(t−1∑r=0

αtrτr[Zr]i + τ⊥t Zi

)(t−1∑r=0

αtrτr[Zr]j + τ⊥t Zj

)]= τ2

t E[[Zt]i, [Zt]j

]=

τ2t if i = j,

0 otherwise.

(2) For 0 ≤ s ≤ (t− 1) and all i, j ∈ Γ,

E

[τs[Zs]i

(t−1∑r=0


)]= τsτtE

[[Zs]i, [Zt]j

]=

Es,t if i = j,

0 otherwise.

First, consider (1). We note that

E

[(t−1∑r=0

αtrτr[Zr]i+τ⊥t Zi

)(t−1∑r=0


)](a)=

t−1∑r=0

t−1∑s=0

αtrαtsτrτsE

[[Zr]i[Zs]j

]+(τ⊥t )2E [ZiZj ]

(b)=

∑t−1

r=0

∑t−1s=0 α

trα

tsEr,s + (τ⊥t )2 (c)

= τ2t , if i = j,

0, otherwise.

In the above, step (a) follows from the fact that Z is independent of Z0, . . . , Zt−1, step (b) from

the covariance definition (2.32) and the fact that the elements of Z are i.i.d. standard Gaussian,

and step (c) from

t−1∑r=0

t−1∑l=0

αtrαtlEr,l = (αt)∗Ctαt = [E∗t (C

t)−1](Ct)−1[(Ct)−1Et] = E∗t (Ct)−1Et = Et,t − (τ⊥t )2.

25

Next, consider (2). We see that

E

[τs[Zs]i

(t−1∑r=0


)](a)=

t−1∑r=0

αtrτsτrE[[Zs]i[Zr]j

](b)=

∑t−1

r=0 Es,rαtr, if i = j,

0, otherwise.

In the above, step (a) follows, since Z is independent of Zs and step (b) from (2.32). Finally,

notice that∑t−1

r=0 Es,rαtr = [Ctαt]s+1 = Es,t, where the first equality holds, since the sum equals

the inner product of the (s + 1)th row of Ct with αt and the second equality by definition of

αt in (2.36).

Next, we prove (2.45), also by induction. For t = 0, by (2.39) we have ht+1|St+1,t

d= τ0Z0 +

∆1,0d= h1

pure + ∆1,0. Assume that hr+1|St+1,t

d= hr+1

pure +∑r

i=0 dri∆i+1,i holds for r = 0, . . . , t− 1

as the inductive hypothesis. Then,

ht+1|St+1,t

d=

t−1∑r=0

αtrhr+1 + τ⊥t Zt + ∆t+1,t

d=

t−1∑r=0

αtr

(hr+1pure +

r∑i=0

dri∆i+1,i

)+ τ⊥t Zt + ∆t+1,t

=t−1∑r=0

αtrhr+1pure + τ⊥t Zt +

t−1∑r=0

r∑i=0

αtrdri∆i+1,i + ∆t+1,t = ht+1

pure +t∑i=0

dti∆i+1,i.

In the above, the first equality uses (2.39) and the second the inductive hypothesis. The last

equality follows by noticing that∑t−1

r=0

∑ri=0 vr,i =

∑t−1i=0

∑t−1r=i vr,i for (vi,r)0≤i,r≤t−1 and using

(2.43).

2.3.4 Main Concentration Lemma

Lemma 2.3.4. We use the shorthand Xn.= c to denote the concentration inequality P (|Xn − c| ≥

ε) ≤ Kk,te−κk,tnε2, where Kk,t, κk,t denote constants depending on the iteration index tthe fixed

half-window size k, but not on n or ε. The following statements hold for 0 ≤ t < T ∗ and

ε ∈ (0, 1).

(a) For ∆t+1,t defined in (2.40) and (2.41),

P

(1

|Γ|‖∆t+1,t‖2 ≥ ε

)≤ Kk,te

−κk,tnε. (2.46)

26

(b) For pseudo-Lipschitz functions φh : R(t+2)|Λ| → R

1

|Γ|∑i∈Γ

φh

(h1

Λi , . . . , ht+1Λi

,xΛi

).=

1

|Γ|∑i∈Γ

E[φh

(τ0[Z0]Λi , . . . , τt[Zt]Λi ,xΛi

)]. (2.47)

The RΓ-valued random variables Z0 = [Z0]ii∈Γ, . . . , Zt = [Zt]ii∈Γ are jointly Gaussian

with zero mean entries, which are independent of the other entries in the same vector with

covariance across iterations given by (2.32), and are independent of x ∼ µ.

(c) Recall that the operator V rearranges elements of an array into a vector,

(ht+1)∗q0

n

.= 0,

(ht+1)∗V(x)

n

.= 0, (2.48)

(bt)∗w

n

.= 0. (2.49)

(d) For all 0 ≤ r ≤ t,

(hr+1)∗ht+1

|Γ|.= Er,t, (2.50)

(br)∗bt

n

.= Er,t. (2.51)

(e) For all 0 ≤ r ≤ t,

(q0)∗qt+1

n

.= E0,t+1,

(qr+1)∗qt+1

n

.= Er+1,t+1, (2.52)

(mr)∗mt

n

.= Er,t. (2.53)

(f) For all 0 ≤ r ≤ t,

λt.= λt,

(ht+1)∗qr+1

n

.= λr+1Er,t,

(hr+1)∗qt+1

n

.= λt+1Er,t, (2.54)

ξt.= ξt,

(br)∗mt

n

.= ξtEr,t,

(bt)∗mr

n

.= ξrEr,t. (2.55)

(g) For Qt+1 = 1nQ∗t+1Qt+1 and Mt = 1

nM∗tMt, when the inverses exist, for all 0 ≤ i, j ≤ t and

27

0 ≤ i′, j′ ≤ t− 1:

[Q−1t+1

]i+1,j+1

.= [(Ct+1)−1]i+1,j+1, γt+1

i.= γt+1

i , (2.56)[M−1t

]i′+1,j′+1

.= [(Ct)−1]i′+1,j′+1, αti′

.= αti′ , t ≥ 1, (2.57)

where γt+1i and αti′ are defined in (2.36),

(h) With σ⊥t+1, τ⊥t defined in (2.37),

1

n

∥∥qt+1⊥∥∥2 .

= (σ⊥t+1)2, (2.58)

1

n

∥∥mt⊥∥∥2 .

= (τ⊥t )2. (2.59)

2.3.5 Proof of Theorem 2.2.1

Proof. Applying part (b) of Lemma 2.3.4 to a pseudo-Lipschitz function φh : R2|Λ| → R,

P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

(φh(ht+1

Λi,xΛi)− E [φh(τtZΛi ,xΛi)]

)∣∣∣∣∣ ≥ ε)≤ Kk,te

−κk,tnε2 ,

where x = xii∈Γ with distribution measure µ on EΓ is independent of Z = Zii∈Γ with i.i.d.

N (0, 1) entries. Now for i ∈ Γ let

φh(ht+1Λi

,xΛi) := φ(ηt(xΛi − ht+1Λi

), xi), (2.60)

where φ : R2 → R is the pseudo-Lipschitz function in the statement of the theorem. The function

φh(ht+1Λi

,xΛi) in (2.60) is pseudo-Lipschitz since φ is pseudo-Lipschitz and ηt is Lipschitz. We

therefore obtain

P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

(φ(ηt(xΛi − ht+1

Λi), xi)− E [φ(ηt(xΛi − τtZΛi), xi)]

)∣∣∣∣∣ ≥ ε)≤ Kk,te

−κk,tnε2 .

The proof is completed by noting from (2.7) and (2.26) that xt+1i = ηt([V−1

(A∗zt

)+ xt]Λi) =

ηt(xΛi − ht+1Λi

).

28

2.4 Proof of Lemma 2.3.4

We make use of concentration results listed in Appendices A.1, A.2, and A.3, where Appendix

A.3 contains concentration results for dependent variables that were needed to provide the new

results in this paper. Note that the lemmas that are stated in Appendices are labeled by capital

letters with numbers (e.g., Lemma A.1.1), whereas the lemmas that are stated in the body are

labeled by numbers (e.g., Lemma 2.3.1).

The proof for Lemma 2.3.4 proceeds by induction on t. We label as Ht+1 the results (2.46),

(2.47), (2.48), (2.50), (2.52), (2.54), (2.56), (2.58) and similarly as Bt the results (2.49), (2.51),

(2.53), (2.55), (2.57), (2.59). The proof consists of four steps: (1) proving that B0 holds, (2)

proving that H1 holds, (3) assuming that Br and Hs hold for all r < t and s ≤ t, then proving

that Bt holds, and (4) assuming that Br and Hs hold for all r ≤ t and s ≤ t, then proving

that Ht+1 holds. The proof for steps (1) and (3) – the B steps – follow as in [104] and are not

repeated here. In what follows we show only steps (2) and (4).

For each step, in parts (a)–(h) of the proof, we use K and κ to label universal constants,

meaning that they do not depend on n or ε, but may depend on t and k, in the concentration

upper bounds. Moreover, we use the acronym PL(2) for (order-2) pseudo-Lipschitz.

2.4.1 Step 2: Showing that H1 holds

Throughout the proof we will make use of a function S : RΛ → R that selects the center

coordinate of its argument. For example, for v ∈ RΓ,

S([v]Λi) = vi. (2.61)

We will only use S in cases where such a “center point” is well-defined. Notice that S is Lipschitz,

since |S(x)−S(x′)| = |xc − x′c| ≤ ‖x− x′‖, for all x,x′ ∈ RΛ, where xc (respectively, x′c) is the

center coordinate of x (respectively, x′). Moreover, if a function f : RΛ×Λ → R is defined as

f(x,y) := S(x) with arbitrary but fixed Λ, then f is Lipschitz, because |f(x,y)− f(x′,y′)| =|S(x)− S(x′)| ≤ ‖x− x′‖ ≤ ‖(x,y)− (x′,y′)‖. We are now ready to prove H1(a)−H1(h).

(a) The proof of H1(a) follows as the corresponding proof in [104].

(b) Let Z0 := V−1(Z0) and ∆1,0 := V−1(∆1,0), hence Z0 and ∆1,0 are RΓ-valued. For t = 0,

29

the left-hand side (LHS) of (2.47) can be bounded as

P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

(φh([h1]Λi ,xΛi)− E[φh(τ0[Z0]Λi ,xΛi)]

)∣∣∣∣∣ ≥ ε)

(a)= P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

(φh([τ0Z0 + ∆1,0]Λi ,xΛi)− E[φh(τ0[Z0]Λi ,xΛi)]

)∣∣∣∣∣ ≥ ε)

(b)

≤ P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

(EZ0

[φh(τ0[Z0]Λi ,xΛi)]− EZ0,x

[φh(τ0[Z0]Λi ,xΛi)])∣∣∣∣∣ ≥ ε

3

)

+ P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

(φh(τ0[Z0]Λi ,xΛi)− E

Z0[φh(τ0[Z0]Λi ,xΛi)]

)∣∣∣∣∣ ≥ ε

3

)

+ P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

(φh([τ0Z0 + ∆1,0]Λi ,xΛi)− φh(τ0[Z0]Λi ,xΛi)

)∣∣∣∣∣ ≥ ε

3

).

(2.62)

Step (a) follows from the conditional distribution of h1 given in Lemma 2.3.2 (2.39) and since

h1 = V−1(h1) and τ0Z0 + ∆1,0 = V−1(τ0Z0 + ∆1,0). Step (b) follows from Lemma A.1.1. Label

the terms on the right-hand side (RHS) of (2.62) as T1− T3. We show that each of these terms

is bounded above by Ke−κnε2.

First, consider T1. Recall the definition of the functions Ti for i ∈ Γ in (2.10), which extends

an array in RΛi∩Γ to an array in RΛ by defining the extended entries to be the average of the

entries in the original array. For arbitrary but fixed s ∈ RΛ, the function φh,i : RΛi∩Γ×Λ → Rdefined as φh,i(v, s) := φh(Ti(v), s) is PL(2) by Lemma A.2.4. Then it follows from Lemma

A.2.3 that the function φ1,i : RΛ → R defined as φ1,i(s) := EZ0

[φh,i(τ0[Z0]Λi∩Γ, s)

]is PL(2),

since [Z0]Λi∩Γ is an array of i.i.d. standard norm random variables for all i ∈ Γ. Notice that

EZ0

[φh,i(τ0[Z0]Λi∩Γ, s)

]= E

Z0

[φh(τ0[Z0]Λi , s)

]by the definition of φh,i and Ti. Therefore,

T1(a)= P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

(φ1,i(Ti(xΛi∩Γ))− E[φ1,i(Ti(xΛi∩Γ))])

∣∣∣∣∣ ≥ ε)

(b)

≤ Ke−κnε2,

where in step (a) we use the definition of Ti in (2.10) and step (b) follows from Lemma A.3.2

by noticing from Lemma A.2.4 that the function φ1,i(Ti(·)) is PL(2) for all i ∈ Γ.

Next, consider T2. Notice that

T2 = Ex

[P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ



)∣∣∣∣∣ ≥ ε∣∣∣∣∣ x

)].

30

Define a function φ : EΓ → [0, 1] as

φ(a) := P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ



)∣∣∣∣∣ ≥ ε∣∣∣∣∣ x = a

)

= P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

(φh(τ0[Z0]Λi ,aΛi)− E

Z0[φh(τ0[Z0]Λi ,aΛi)]

)∣∣∣∣∣ ≥ ε), ∀a ∈ EΓ.

Then T2 = E[φ(x)

]. For arbitrary but fixed a ∈ EΓ, let a function φ2,i : RΛ → R be defined as

φ2,i(s) := φh(s,aΛi) for each i ∈ Γ. Denoting the pseudo-Lipschitz constant of φh by L and the

bound for E by M , namely |x| ≤M, ∀x ∈ E, we have that φ2,i is PL(2) with pseudo-Lipschitz

constant independent of aΛi :

|φ2,i(s1)− φ2,i(s2)| = |φh(s1,aΛi)− φh(s2,aΛi)|≤L (1 + ‖s1‖+ ‖aΛi‖+ ‖s2‖+ ‖aΛi‖) ‖s1 − s2‖

≤ (1 + 2‖aΛi‖)L(1 + ‖s1‖+ ‖s2‖)‖s1 − s2‖

≤ (1 + 2√|Λ|M)L(1 + ‖s1‖+ ‖s2‖)‖s1 − s2‖.

Using (1 + 2√|Λ|M)L as the pseudo-Lipschitz constant for φ2,i for all i ∈ Γ, it follows that

φ(a)(a)= P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

(φ2,i(Ti(τ0[Z0]Λi∩Γ))− E

[φ2,i(Ti(τ0[Z0]Λi∩Γ))

])∣∣∣∣∣ ≥ ε)

(b)

≤ Ke−κnε2.

In the above, step (a) uses the definition of Ti in (2.10) and the fact that [Z0]Λi∩Γd= [Z0]Λi∩Γ for

all i ∈ Γ, and therefore, E[φ2,i(Ti(τ0[Z0]Λi∩Γ))

]= E

[φ2,i(Ti(τ0[Z0]Λi∩Γ))

]for all i ∈ Γ. Step (b)

follows by Lemma A.3.4 by noticing from Lemma A.2.4 that the function φ2,i(Ti(·)) is PL(2).

Note that the constants K and κ do not depend on a. Therefore, T2 = E[φ(x)

]≤ Ke−κnε2 .

Finally consider T3, the third term on the RHS of (2.62).

T3

(a)

≤ P

(1

|Γ|∑i∈Γ

L(

1 +∥∥∥[τ0Z0 + ∆1,0]Λi

∥∥∥+∥∥∥τ0[Z0]Λi

∥∥∥)∥∥∥[∆1,0]Λi

∥∥∥ ≥ ε

3

)(b)

≤ P

(1√|Γ|

∥∥∥∆1,0

∥∥∥ ·(1 +

√2d

|Γ|

∥∥∥∆1,0

∥∥∥+ 2τ0

√2d

|Γ|

∥∥∥Z0

∥∥∥) ≥ ε

3L√

6d

). (2.63)

Step (a) follows from the fact that φh is PL(2). Step (b) uses∥∥∥[τ0Z0 + ∆1,0]Λi

∥∥∥ ≤ ∥∥∥τ0[Z0]Λi

∥∥∥+∥∥∥[∆1,0]Λi

∥∥∥ by the triangle inequality, the Cauchy-Schwarz inequality, the fact that for a ∈ RΓ,∑i∈Γ ‖aΛi‖

2 ≤ 2d ‖a‖2, where d = |Λ| = (2k + 1)p, and the following application of Lemma

31

A.2.5: ∑i∈Γ

(1 +

∥∥∥[∆1,0]Λi

∥∥∥+ 2∥∥∥τ0[Z0]Λi

∥∥∥)2≤ 3

(|Γ|+ 2d

∥∥∥∆1,0

∥∥∥2+ 4τ2

0 2d∥∥∥Z0

∥∥∥2).

From (2.63), we have

T3 ≤ P

(1√|Γ|

∥∥∥Z0

∥∥∥ ≥ 2

)+ P

1√|Γ|

∥∥∥∆1,0

∥∥∥ ≥ ε√2d

min

1, 13L√

3

2 + 4τ0

√2d

(a)

≤ e−δn +Ke−κnε2,

where to obtain (a), we use Lemma A.1.4 and H1(a).

(c) We first show concentration for 1n(h1)∗V(x) = 1

n

∑i∈Γ h

1ixi. Let the function φ1 :

R2|Λ| → R be defined as φ1(x,y) := S(x)S(y) for any (x,y) ∈ RΛ×Λ where the operator

S is defined in (2.61). Then, using the fact that φ1(h1Λi,xΛi) = h1

ixi and E[φ1(τ0[Z0]Λi ,xΛi)] =

E[[τ0Z0]i]E[Xi] = 0 for all i ∈ Γ, since [Z0]i has zero-valued mean and is independent of Xi, we

find

P

(∣∣∣∣(h1)∗V(x)

n

∣∣∣∣ ≥ ε) = P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

(φ1(h1

Λi ,xΛi)− E[φ1(τ0[Z0]Λi ,xΛi)])∣∣∣∣∣ ≥ δε

).

Finally, note that φ1 is pseudo-Lipschitz, since S is Lipschitz by Lemma A.2.2, hence, we can

apply H1(b) to give the desired upper bound.

Next, we show concentration for 1n(h1)∗q0 = 1

n

∑i∈Γ h

1i q

0i . Recall that q0

i = f0(0,xΛi), for

all i ∈ Γ. The function φ2 : R2|Λ| → R defined as φ2(x,y) := S(x)f0(0,y) is pseudo-Lipschitz

by Lemma A.2.2, since S and f0 are both Lipschitz. Notice that φ2([h1]Λi ,xΛi) = h1i q

0i and

E[φ2(τ0[Z0]Λi ,xΛi)] = E[τ0[Z0]i]E[f0(0,xΛi)] = 0 for all i ∈ Γ, since [Z0]i has zero-valued mean

and is independent of x. Therefore, using H1(b),

P

(∣∣∣∣(h1)∗q0

n

∣∣∣∣ ≥ ε) = P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

(φ2(h1


)≤ Ke−κnε2 .

(d) The function φ3 : R2|Λ| → R defined as φ3(x,y) := (S(x))2 is pseudo-Lipschitz by

Lemma A.2.2, since the operator S defined in (2.61) is Lipschitz. Notice that 1|Γ|(h

1)∗h1 =1|Γ|∑

i∈Γ φ3(h1Λi,xΛi) and E[φ3(τ0[Z0]Λi ,xΛi)] = τ2

0E[([Z0]i)2] = τ2

0 for all i ∈ Γ, which follows

32

from the definition of Z0 in (2.32). Therefore, the result follows using H1(b), since

P

(∣∣∣∣ 1

|Γ|∥∥h1

∥∥2 − τ20

∣∣∣∣ ≥ ε) = P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

(φ3(h1

Λi ,xΛi)− E[φ3(τ0[Z0]Λi ,xΛi)])∣∣∣∣∣ ≥ ε

).

(e) We prove concentration for 1n(q0)∗q1 and the result for 1

n(q1)∗q1 follows similarly. The

function φ4 : R2|Λ| → R defined as φ4(x,y) := f0(0,y)f1(x,y) is PL(2) by Lemma A.2.2, since

f0 and f1 are Lipschitz. Notice that 1|Γ|∑

i∈Γ E[f0(0,xΛi)f1(τ0[Z0]Λi ,xΛi)] = δE0,1 by (2.33)

and (q0)∗q1 =∑

i∈Γ φ4(h1Λi,xΛi). Hence we have the desired upper bound using H1(b), since

P

(∣∣∣∣ 1n(q0)∗q1 − E0,1

∣∣∣∣ ≥ ε) = P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

(φ4(h1


).

(f) The concentration of λ0 to λ0 follows from H1(b) applied to the function φh(h1Λi,xΛi) :=

f ′0(h1Λi,xΛi), since f ′0 is assumed to be Lipschitz, hence PL(2).

The only other result to prove is concentration for 1n(h1)∗q1 = 1

n

∑i∈Γ h

1i q

1i . The function

φ5 : R2|Λ| → R defined as φ5(x,y) = S(x)f1(x,y) is pseudo-Lipschitz by Lemma A.2.2. No-

tice that φ5(h1Λi,xΛi) = h1

i q1i . Moreover, let the function fi : R → R be defined as fi(x) :=

E[Z0]Λi\i,xΛi

[f1(R(x, [τ0Z0]Λi),xΛi)], where the function R : R1×Λ → RΛ replaces the center

coordinate of the second argument, which is in RΛ, with the first argument, which is in R. For

example, fi([τ0Z0]i) = E[Z0]Λi\i,xΛi

[f1([τ0Z0]Λi ,xΛi)]. Then we have

∑i∈Γ

E[φ5(τ0[Z0]Λi ,xΛi)] =∑i∈Γ

E[[τ0Z0]if1(τ0[Z0]Λi ,xΛi)] =∑i∈Γ

E[Z0]i

[[τ0Z0]ifi([τ0Z0]i)]

(a)=∑i∈Γ

E[Z0]i

[([τ0Z0]i)2]E

[Z0]i[f ′i([τ0Z0]i)]

(b)= τ2

0

∑i∈Γ

E[Z0]Λi ,xΛi

[f ′1([τ0Z0]Λi ,xΛi)]

(c)= δ|Γ|λ1E0,0.

In the above, step (a) follows by Stein’s Method, Lemma A.2.1, step (b) follows from the

definition of Z0 in (2.32) and the definition of f ′1, which is the partial derivative w.r.t. the

center coordinate of the first arguments, and step (c) follows from the definition of λ1 in (2.38)

and the definition of E0,0 in (2.33). Therefore, using H1(b), we have the desired upper bound,

33

since

P

(∣∣∣∣ 1n(h1)∗q1 − λ1E0,0

∣∣∣∣ ≥ ε) = P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

(φ5(h1


).

(g), (h) The proofs of H1(g),H1(h) follow as the corresponding proofs in [104].

2.4.2 Step 4: Showing that Ht+1 holds

The probability statements in the lemma and the other parts of Ht+1 are conditioned on the

event that the matrices Q1, . . . ,Qt+1 are invertible, but for the sake of brevity, we do not

explicitly state the conditioning in the probabilities. The following lemma, whose proof is the

same as in [104], will be used to prove Ht+1.

Lemma 2.4.1. [104, Lemma 5.2] Let v := 1nB∗t+1m

t⊥−

1nQ∗t+1(ξtq

t−∑t−1

i=0 αtiξiq

i) and Qt+1 :=1nQ∗t+1Qt+1. Then for j ∈ [t+ 1],

P(∣∣∣[Q−1

t+1v]j

∣∣∣ ≥ ε) ≤ e−κnε2 .We are ready to prove Ht+1(a)−Ht+1(h).

(a) The proof of Ht+1(a) follows as the corresponding proof in [104]

(b) For brevity we use the notation Eφh := 1|Γ|∑

i∈Γ E[φh(τ0[Z0]Λi , ..., τt[Zt]Λi ,xΛi)], and

ai := ([h1pure]i + [∆1,0]i, ..., [h

t+1pure]i +

t∑r=0

dtr[∆r+1,r]i, xi),

ci := ([h1pure]i, ..., [h

t+1pure]i, xi), (2.64)

for i ∈ Γ. Hence a and c are arrays in RΓ with entries ai, ci ∈ R(t+2). We note that by aΛi we

mean for the p-dimensional cube Λi to be applied to each of the (t + 2) elements of a and we

define ‖aΛi‖2 :=

∑j∈Λi‖aj‖2. Moreover, define ∆r+1,r = V−1(∆r+1,r), hence ∆r+1,r ∈ RΓ, for

all r = 0, . . . , t.

Then, using the conditional distribution of ht+1 from Lemma 2.3.3 and Lemma A.1.1, we

34

have

P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

φh(h1Λi , ..., h

t+1Λi

,xΛi)− Eφh

∣∣∣∣∣ ≥ ε)

≤ P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

(φh(aΛi)− φh(cΛi))

∣∣∣∣∣ ≥ ε

2

)+ P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

φh(cΛi)− Eφh

∣∣∣∣∣ ≥ ε

2

). (2.65)

Label the terms of (2.65) as T1 and T2. We next show that both terms are bounded by Ke−κnε2.

First, consider term T1. Let d = |Λ|. Notice that

‖a− c‖2 =∑i∈Γ

[∆1,0]2i +

(1∑r=0

d1r [∆r+1,r]i

)2

+ . . .+

(t∑

r=0

dtr[∆r+1,r]i

)2

(a)

≤ ‖∆1,0‖2 +1∑r=0

(d1r)

21∑

r′=0

‖∆r′+1,r′‖2 + . . .+t∑

r=0

(dtr)2

t∑r′=0

‖∆r′+1,r′‖2

(b)=

t∑r′=0

‖∆r′+1,r′‖2t∑

k=r′

k∑r=0

(dkr )2

In the above, step (a) follows from Cauchy-Schwartz and step (b) by collecting the terms in the

sums. Hence,

‖a− c‖2

|Γ|≤

(‖∆1,0‖√|Γ|

t∑k=0

k∑r=0

|dkr |+‖∆2,1‖√|Γ|

t∑k=1

k∑r=0

|dkr |+ . . .+‖∆t+1,t‖√|Γ||

t∑r=0

|dtr|

)2

. (2.66)

Denote the RHS of (2.66) by ∆2total, then using Lemma A.1.1 and H1(a)−Ht+1(a), we have

P

(‖a− c‖√|Γ|

≥ ε

)≤ P (∆total ≥ ε) ≤ Ke−κ|Γ|ε

2. (2.67)

35

Now, using the pseudo-Lipschitz property of φh, we have

T1 ≤ P

(1

|Γ|∑i∈Γ

L(1 + ‖aΛi‖+ ‖cΛi‖)‖[a− c]Λi‖ ≥ε

2

)(a)

≤ P

( 1

|Γ|∑i∈Γ

(1 + ‖aΛi‖+ ‖cΛi‖)2

)1/2(1

|Γ|∑i∈Γ

‖[a− c]Λi‖2)1/2

≥ ε

2L

(b)

≤ P

((1 +√

2d‖a‖√|Γ|

+√

2d‖c‖√|Γ|

)∆total ≥

ε

2√

6dL

)(c)

≤ P

((1 + 2

√2d‖c‖√|Γ|

+√

2d∆total

)∆total ≥

ε

2√

6dL

).

(2.68)

In the above, step (a) follows by Cauchy-Schwartz. Step (b) follows from an application of

Lemma A.2.5:∑i∈Γ

(1 + ‖aΛi‖+ ‖cΛi‖)2 ≤ 3

∑i∈Γ

(1 + ‖aΛi‖2 + ‖cΛi‖2

)≤ 3

(√|Γ|+

√2d‖a‖+

√2d‖c‖

)2,

and∑

i∈Γ ‖[a− c]Λi‖2 ≤ 2d‖a− c‖2 along with (2.66). Step (c) follows by ‖a‖ ≤ ‖a− c‖+‖c‖ ≤√|Γ|∆total + ‖c‖. Notice that

‖c‖2 =t∑

r=0

‖hr+1pure‖2 + ‖x‖2 d

=t∑

r=0

τ2r ‖Zr‖2 + ‖x‖2,

where the last step follows by Lemma 2.3.3. Define Ec :=∑t

r=0 τ2r + σ2

x. Then

P

(∣∣∣∣‖c‖2|Γ| − Ec∣∣∣∣ ≥ ε) ≤ t∑

r=0

P

(∣∣∣∣∣‖Zr‖2|Γ|− 1

∣∣∣∣∣ ≥ ε

(t+ 2)τ2r

)+ P

(∣∣∣∣‖x‖2|Γ| − σ2x

∣∣∣∣ ≥ ε

t+ 2

)≤ Ke−κ|Γ|ε2 , (2.69)

where the last step follows by Lemma A.1.4 and (2.28). Therefore, using the bound in (2.68),

T1 ≤ P

((1 + 2

√2d

(‖c‖√|Γ|− E1/2

c

)+ 2√

2dE1/2c +

√2d∆total

)∆total ≥

ε

2√

6dL

)

≤ P

(∣∣∣∣∣ ‖c‖√|Γ|− E1/2

c

∣∣∣∣∣ ≥ ε√2d

)+ P

(∆total ≥

ε√2d

min1, 12√

3L

4 + 2√

2dE1/2c

)≤ Ke−κ|Γ|ε2 ,

36

where the last step follows by (2.67) and (2.69).

Next, consider term T2 of (2.65).

P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

[φh([h1

pure]Λi , ..., [ht+1pure]Λi ,xΛi)− E[φh(τ0[Z0]Λi , ..., τt[Zt]Λi ,xΛi)]

]∣∣∣∣∣ ≥ ε

2

)

≤ P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

[φh([h1

pure]Λi , ..., [ht+1pure]Λi ,xΛi)− E

Z1,...,Zt[φh(τ0[Z0]Λi , ..., τt[Zt]Λi ,xΛi)]

]∣∣∣∣∣ ≥ ε

4

)

+ P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

[EZ1,...,Zt

[φh(τ0[Z0]Λi , ..., τt[Zt]Λi ,xΛi)]− E[φh(τ0[Z0]Λi , ..., τt[Zt]Λi ,xΛi)]]∣∣∣∣∣ ≥ ε

4

).

Label the two terms on the RHS as T2a and T2b. T2a can be bounded in a similar way as T2 in

(2.62) and T2b has the desired bound by Lemma A.3.2, since the function φh : RΛ → R defined

as

φ(s) := EZ1,...,Zt

[φh(τ0[Z0]Λi , ..., τt[Zt]Λi , s)] = EZ1,...,Zt

[φh(τ0Ti([Z0]Λi∩Γ), ..., τtTi([Zt]Λi∩Γ), s)]

is PL(2) by Lemmas A.2.3 and A.2.4.

(c) We first show the concentration of 1n(ht+1)∗V(x) = 1

n

∑i∈Γ h

t+1i xi. Using the PL(2)

function φ1 defined in H1(c), we have that φ1(ht+1Λi

,xΛi) = ht+1i xi and E[φ1(τt[Zt]Λi ,xΛi)] =

E[τt[Zt]i]E[Xi] = 0 for all i ∈ Γ, since [Zt]i has zero-valued mean and is independent of Xi.

Therefore, Ht+1(b) gives the desired upper bound, since

P

(∣∣∣∣(ht+1)∗V(x)

n

∣∣∣∣ ≥ ε) = P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

(φ1(ht+1

Λi,xΛi)− E[φ1(τt[Zt]Λi ,xΛi)]

)∣∣∣∣∣ ≥ δε).

We now show the concentration of 1n(ht+1)∗q0 = 1

n

∑i∈Γ h

t+1i q0

i . Using the PL(2) func-

tion φ2 defined in H1(c), we have that φ2(ht+1Λi

,xΛi) = ht+1i q0

i and E[φ2(τt[Zt]Λi ,xΛi)] =

E[τt[Zt]i]E[f0(0,xΛi)] = 0, since [Zt]i has zero-valued mean and is independent of xΛi for all

i ∈ Γ. Therefore, using Ht+1(b), we have the desired upper bound, since

P

(∣∣∣∣(ht+1)∗q0

n

∣∣∣∣ ≥ ε) = P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

(φ2(ht+1


)∣∣∣∣∣ ≥ δε).

(d) Let a function φ3 : R3|Λ| → R be defined as φ3(x,y, z) = S(x)S(y). Since the function

S defined in (2.61) is Lipschitz, φ3 is PL(2) by Lemma A.2.2. Note that φ3(hr+1Λi

, ht+1Λi

,xΛi) =

hr+1i ht+1

i and E[φ3(τr[Zr]Λi , τt[Zt]Λi ,xΛi)] = τrτtE[[Zr]i[Zt]i] = Er,t, where the last equality

37

follows from the definition in (2.32). Therefore, the result follows by Ht+1(b), since

P

(∣∣∣∣ 1

|Γ|(hr+1)∗ht+1 − Er,t

∣∣∣∣ ≥ ε)= P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

φ3(hr+1Λi

, ht+1Λi

,xΛi)− E[φ3(τr[Zr]Λi , τt[Zt]Λi ,xΛi ]

∣∣∣∣∣ ≥ ε).

(e) We will show the concentration of 1n(q0)∗qt+1 = 1

n

∑i∈Γ q

0i qt+1i and the concentration

of 1n(qr+1)∗qt+1 follows similarly. The function φ4(x,y) : R2|Λ| → R defined as φ4(x,y) :=

f0(0,y)ft+1(x,y) is PL(2) by Lemma A.2.2 and φ4(ht+1Λi

,xΛi) = q0i qt+1i . Moreover,∑

i∈Γ

E[φ4(τt[Zt]Λi ,xΛi)] =∑i∈Γ

E[f0(0,xΛi)ft+1(τt[Zt]Λi ,xΛi)] = δ|Γ|E0,t+1,

by definition (2.33). Therefore, using Ht+1(b), we have the desired result, since

P

(∣∣∣∣(q0)∗qt+1

n− E0,t+1

∣∣∣∣ ≥ ε) = P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

(φ4(ht+1


)∣∣∣∣∣ ≥ δε).

(f) The concentration of λt around λt followsHt+1(b) applied to the function φh(ht+1Λi

,xΛi) :=

f ′t+1(ht+1Λi

,xΛi), since f ′t+1 is assumed to be Lipschitz, hence PL(2).

Next, we show concentration for 1n(ht+1)∗qr+1 = 1

n

∑i∈Γ h

t+1i qr+1

i . Let a function φ5 :

R3|Λ| → R be defined as φ5(x,y, z) := S(y)fr+1(x, z) which is PL(2) by Lemma A.2.2. Note

that φ5(hr+1Λi

, ht+1Λi

,xΛi) = ht+1i qr+1

i and∑i∈Γ

E[φ5(τr[Zr]Λi , τt[Zt]Λi ,xΛi)] =∑i∈Γ

E[[τtZt]ifr+1(τr[Zr]Λi ,xΛi)] = |Γ|λr+1Er,t, (2.70)

where the last equality follows using Stein’s Method, Lemma A.2.1, as in H1(f). Therefore,

Ht+1(b) gives the desired result, since

P

(∣∣∣∣(ht+1)∗qr+1

n− λr+1Er,t

∣∣∣∣ ≥ ε)= P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

(φ5(hr+1

Λi, ht+1

Λi,xΛi)− E[φ5(τr[Zr]Λi , τt[Zt]Λi ,xΛi)]

)∣∣∣∣∣ ≥ δε).

(g) (h) The proof for Ht+1(g),Ht+1(h) is similar to the proof for Bt(g),Bt(h) in [104].

38

2.5 Additional Result for 1D Signals with Markov Chain Priors

Recall that our definition of the state evolution sequence σ2t t≥0 in (2.14) is a weighted sum

with weights depending on the half-window size k and the size of the lattice Γ, which defines

the indices of the unknown signal x. In this section, we restrict to the 1D case, namely Γ = [N ]

and x has a Markov chain prior. We present an alternative state evolution analysis for AMP

with sliding-window denoisers where the definition of σ2t t≥0 does not have weights depending

on k or |Γ| as long as k is finite and fixed. The difficulty for extending this result to 2D and 3D

MRF cases is that the number of edge entries, namely the size of Γedge defined in (2.8), grows

as the size of Γ grows, unlike in the 1D case, |Γedge | = 2k, hence is fixed as long as k is fixed.

2.5.1 Definitions and Assumptions

The assumptions on the matrix A, noise w, and denoisers ηtt≥0 are the same as stated in

Section 2.2.1. We now include definitions of properties of Markov chains that will be useful

to clarify our assumptions for the unknown signal x, which is weaker than the Dobrushin

uniqueness condition stated in Section 2.2.1.

Definition 2.5.1. Consider a Markov chain with transition probability measure r(x, dy) and

stationary distribution measure γ defined on a measurable state space (E, E). Denote the set

of all γ-square-integrable functions by L2(γ) := f : R → R :∫E |f(x)|2γ(dx) < ∞. Define

a linear operator R associated with r(x, dy) as Rf(x) :=∫E f(y)r(x, dy) for f ∈ L2(γ). The

chain is said to be geometrically ergodic on L2(γ) if there exists 0 < ρ < 1 such that for

each probability measure ν that satisfies∫E |

dνdγ |

2dγ <∞, there is a constant Cν <∞ such that

supA∈E

∣∣∣∣∫Ern(x,A)ν(dx)− γ(A)

∣∣∣∣ < Cνρn, n ∈ N,

where rn(x, dy) denotes the n-step transition probability measure. In other words, geometrical

ergodicity means the chain converges to its stationary distribution γ geometrically fast. The

chain is said to be reversible if r(x, dy)γ(dx) = r(y, dx)γ(dy). Moreover, a chain is said to

have a spectral gap on L2(γ) if

g2 := 1− supλ ∈ sp(R) : λ 6= 1 > 0,

where sp(R), the spectrum of R, is a set of values for λ such that (λI−R)−1 does not exist as

a bounded linear operator on L2(γ).

39

It has been proved that a Markov chain has spectral gap on L2(γ) if and only if it is reversible

and geometrically ergodic [100]. We use the existence of a spectral gap to prove concentration

results for pseudo-Lipschitz functions with dependent input, where the dependence is character-

ized by a Markov chain. Such concentration results are crucial for obtaining the main technical

lemma, Lemma 2.3.4, and hence our main result, Theorem 2.2.1. If the spectral gap does not

exist, meaning that g2 = 0, then our proof only bounds the probability of tail events in Lemma

2.3.4 by constant 1, which is useless.

With this definition, we now clarify the assumptions of the unknown signal x under which

our result is proved as follows. Let E ⊂ R be a bounded state space (countable or uncountable).

Fix the half-window size k and let a EN+2k-valued random vector xext = xexti i∈1−k,...,N+k,

where xexti : Ω → E, be a time-homogeneous, reversible, geometrically ergodic Markov chain

starting with its (unique) stationary distribution measure, which is denoted by γ. Note that γ,

which is a probability measure on E, is sub-Gaussian4 since it has bounded support. Therefore,

we have that for all i ∈ [N ] and for all ε > 0,

P( ∣∣xexti − E[xexti ]

∣∣ ≥ ε)≤ Ke−κε2 , (2.71)

for some constants K,κ > 0. We assume that the unknown signal x is a length-N segment of

xext such that xi = xexti for all i ∈ [N ].

The Dobrushin uniqueness condition defined in Section (2.2.1) implies uniform ergodicity,

which is stronger than the geometric ergodicity assumed here.

2.5.2 Performance Guarantee

We now introduce the state evolution sequences that will be used in this new result. Let the

stationary distribution measure γ and the transition probability measure r(x, dy) define the

prior distribution for the unknown vector x in (1.2) with Γ = [N ] and Λ = [2k + 1] per our

definition in (2.4) and (2.5). By our assumption of stationarity, we have xi ∼ γ, for all i ∈ [N ].

Let the E2k+1-valued random vector x′ be distributed as π, where

π(dx) = π((dx1, ..., dx2k+1)) =

2k+1∏i=2

r(xi−1, dxi)γ(dx1) (2.72)

4Recall that a zero-mean random variable X is sub-Gaussian if it satisfies the following property [17]: for all

ε > 0, P (|X − E[X]| ≥ ε) ≤ Ke−κε2

, for some constants K,κ > 0.

40

is the (2k + 1)-dimension marginal of the distribution for x. Define σ2x = E[x2

1] > 0 and σ20 =

σ2x/δ. Iteratively define the quantities σ2

t t≥1 and τ2t t≥0 as follows,

τ2t = σ2

w + σ2t , σ2

t+1 =1

δE[(ηt(x

′ + τtZ′)− x′k+1

)2], (2.73)

where δ = nN , x′k+1 is the (k + 1)th coordinate of x′, and the R2k+1-valued random vector Z′

has i.i.d. N (0, 1) entries and is independent of x′.

Theorem 2.5.1 provides our main performance guarantee, which is a concentration inequality

for (order-2) pseudo-Lipschitz loss functions.

Theorem 2.5.1. With the assumptions in Section 2.5.1, for any (order-2) pseudo-Lipschitz

function φ : R2 → R, ε ∈ (0, 1), and t ≥ 0,

P

(∣∣∣∣∣ 1

N

N∑i=1

φ(xt+1i , xi)− E[φ(ηt(x

′ + τtZ′), x′k+1)]

∣∣∣∣∣ ≥ ε)≤ Kk,te

−κk,tnε2 . (2.74)

In the expectation in (2.17), x′ = (x′1, . . . , x′2k+1) is a random vector with distribution measure π,

and Z′ = (Z ′1, . . . , Z′2k+1) has i.i.d. N (0, 1) entries and is independent of x′. The deterministic

quantity τt is defined in (2.73) and constants Kk,t, κk,t > 0 depend on the iteration index t and

half-window size k, but not on n or ε.


To prove Theorem 2.5.1, we use the same notations for the general recursion (Section 2.3.1,

where we notice that V(·) is identity in this case), and the concentration constants (Section 2.3.2)

as in the MRF case, but re-define the following quantities to match the result in Theorem 2.5.1.

Note that we still use the same definition for the “missing” entries in a window as in (2.9).

2.5.3.1 Re-Define Proof Notations

The state evolution sequences τ2t t≥0 and σ2

t t≥0 for the general recursion (2.19) is now

defined as the following. Let σ20 := E[f(0,x)2] and for t ≥ 0, iteratively define

τ2t := E[(gt(σtZ,W ))2], σ2

t+1 :=1

δE[(ft(τtZ

′,x′))2], (2.75)

where random variables W ∼ pw and Z ∼ N (0, 1) are independent, and x′ = (x′1, . . . , x′2k+1) ∼

π and Z′ = (Z ′1, . . . , Z′2k+1) with i.i.d. N (0, 1) entries are also independent. We assume that

41

both σ20 and τ2

0 are strictly positive. Similar to before, this means we assume σx := E[x21] > 0

and δ <∞ in AMP.

To match this new definition of the state evolution sequences, we need to re-define the

jointly Gaussian random vectors Ztt≥0 and the concentration constants Er,t, Er,t, which were

originally defined in (2.32) and (2.33), respectively. We now let Ztt≥0 be a sequence of R2k+1-

valued jointly Gaussian random vectors with correlation structure defined as follows. For all

i, j ∈ [2k + 1], r, t ≥ 0,

E[ZrZt] =Er,tσrσt

, E[[Zr]i[Zt]j

]=

Er,tτrτt

, if i = j

0, if i 6= j, (2.76)

where

Er,t := E[gr(σrZr,W )gt(σtZt,W )

], Er,t :=

1

δE[fr(τr−1Zr−1,x

′)ft(τt−1Zt−1,x′)]. (2.77)

Note again that both terms of the above are scalar values and we take f0(·,x′) := f0(0,x′), the

initial condition. Moreover, Et,t = σ2t and Et,t = τ2

t with the new definition of σ2t , τ

2t in (2.75)

and for all i ∈ [2k + 1], we have E[[Zt]

2i

]= E[Z2

t ] = 1. Therefore, Zt again has i.i.d. N (0, 1)

entries.

Finally, the concentration constants ξt, λt, which were originally defined in (2.20), are re-

defined using the new definition of σ2t t≥0, τ2

t t≥0 in (2.75) and the new sequence of jointly

Gaussian random vectors Ztt≥0 in (2.76) as

ξt := E[g′t(σtZt,W )

], λt+1 :=

1

δE[f ′t(τtZt,x

′)]. (2.78)

2.5.3.2 Main Technical Lemma

Similar to before, the proof for Theorem 2.5.1 is achieved by using induction to prove a technical

lemma that is in form similar to Lemma 2.3.4, but with the new definition of the concentrating

constants as defined above and a modified Part (b) as stated in the following lemma.

Lemma 2.5.1. For pseudo-Lipschitz functions φh : R(t+2)|Λ| → R,

1

|Γ|∑i∈Γ

φh

(h1

Λi , . . . ,ht+1Λi

,xΛi

).= E

[φh

(τ0Z0, . . . , τtZt,x

′)]. (2.79)

The R2k+1-valued random vectors Z0, . . . , Zt are jointly Gaussian with zero mean entries, which

42

are independent of the other entries in the same vector with covariance across iterations given

by (2.76), and are independent of x′ = (x′1, . . . , x′2k+1) ∼ π. The constants τ2

t t≥0 are defined

in (2.75).

In what follows, we prove (2.79) for t = 0, which corresponds to H1(b) in the induction proof

for Lemma 2.3.4, then the rest of the proof needed to obtain Theorem 2.5.1 follows similarly as

the proof for Theorem 2.2.1.

Proof. For t = 0, the LHS of (2.79) can be bounded as

P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

φh(h1Λi ,xΛi)− E[φh(τ0Z0,x

′)]

∣∣∣∣∣ ≥ ε)

(a)= P

(∣∣∣∣∣ 1

N

∑i∈Γ

φh([τ0Z0 + ∆1,0]Λi ,xΛi)− E[φh(τ0Z0,x′)]

∣∣∣∣∣ ≥ ε)

(b)

≤ P

(∣∣∣∣∣ 1

N

∑i∈Γ

EZ0

[φh(τ0Z0,xextΛi )]− E

Z0,x′[φh(τ0Z0,x

′)]

∣∣∣∣∣ ≥ ε

3

)

+ P

(∣∣∣∣∣ 1

N

∑i∈Γ

[φh(τ0[Zext

0 ]Λi ,xextΛi )− E

Z0[φh(τ0Z0,x

extΛi )]

]∣∣∣∣∣ ≥ ε

3

)

+ P

(∣∣∣∣∣ 1

N

∑i∈Γ

[φh([τ0Z0 + ∆1,0]Λi ,xΛi)− φh(τ0[Zext

0 ]Λi ,xextΛi )]∣∣∣∣∣ ≥ ε

3

).

(2.80)

Step (a) follows from the conditional distribution of h1 given in Lemma 2.3.2 (2.39). Step (b)

from Lemma A.1.1 with the RN+2k- and the EN+2k-valued random vectors Zext0 and xext ,

respectively, being ‘extended’ versions of Z0 and x. Recall that xext is defined in Section 2.5.1

and we have xi = xexti for all i ∈ [N ]. Similarly, we define Zext0 to be a vector of i.i.d. N (0, 1)

random variables such that [Z0]i = [Zext0 ]i for all i ∈ [N ].

Label the terms on the RHS of (2.80) as T1 − T3 and note that T1 and T2 can be bounded

above by Ke−κNε2

in a similar way as bounding T1 and T2 in (2.62). In the following, we

43

consider T3 in (2.80).

T3 ≤ P

(∣∣∣∣∣ 1

N

N∑i=1

[φh([τ0Z0 + ∆1,0]Λi ,xΛi)− φh([τ0Z0]Λi ,xΛi)]

∣∣∣∣∣ ≥ ε

9

)

+ P

(∣∣∣∣∣ 1

N

N∑i=1

[φh([τ0Z0]Λi ,xΛi)− φh(τ0[Zext

0 ]Λi ,xΛi)]∣∣∣∣∣ ≥ ε

9

)

+ P

(∣∣∣∣∣ 1

N

N∑i=1

[φh(τ0[Zext

0 ]Λi ,xΛi)− φh(τ0[Zext0 ]Λi ,x

extΛi )]∣∣∣∣∣ ≥ ε

9

). (2.81)

Label the terms on the right side of the above T3a, T3b, and T3c. First consider T3a.

T3a

(a)

≤ P

(1

N

N∑i=1

L (1 + ‖[τ0Z0 + ∆1,0]Λi‖+ ‖[τ0Z0]Λi‖) ‖[∆1,0]Λi‖ ≥ε

9

)(b)

≤ P

(‖∆1,0‖√

N·(

1 +√

2d‖∆1,0‖√

N+ 2τ0

√2d‖Z0‖√N

)≥ ε

9L√

6d

). (2.82)

Step (a) follows by the pseudo-Lipschitz property of φh and step (b) by ‖[τ0Z0 + ∆1,0]Λi‖ ≤‖[τ0Z0]Λi‖ + ‖[∆1,0]Λi‖, the Cauchy-Schwarz inequality, the following application of Lemma

A.2.5:

(1 + ‖[∆1,0]Λi‖+ 2 ‖[τ0Z0]Λi‖)2 ≤ 3

(1 + ‖[∆1,0]Λi‖

2 + 4 ‖[τ0Z0]Λi‖2),

the fact that for a ∈ RN ,∑N

i=1 ‖aΛi‖2 ≤ 2d ‖a‖2, where d = |Λ|. From (2.82), we have

T3a ≤ P(‖Z0‖√N≥ 2

)+ P

‖∆1,0‖√N≥

ε√2d

min

1, 118L√

3

2 + 4τ0

√2d

(c)

≤ e−N +Ke−κNε2,

where to obtain (c),we use Lemma A.1.4 and H1(a).

Now consider term T3b. Similar to the way that we obtained (2.82), we have

T3b ≤ P

(1 +√

2d‖Z0‖√N

+ τ0

√d

√N + 2k√N

‖Zext0 ‖√

N + 2k

) √∑Ni=1 ‖[Z0]Λi − [Zext

0 ]Λi‖2

√N

≥ ε

9L√

3

.

44

Using the above, as well as our assumption that 2k + 1 < N , we have

T3b ≤ P(‖Z0‖√N≥ 2

)+P

(‖[Zext

0 ]Γ‖√N

≥ 2

)+P

√∑N

i=1 ‖[Z0]Λi − [Zext0 ]Λi‖

2

√N

≥ ε/(27Lτ0

√2d)

(1 + 4τ0

√2d)

.

(2.83)

Note that the first two terms in the above are upper bounded by e−N using Lemma A.1.4 and

so we focus on the third term. Notice that√√√√ N∑i=1

‖[Z0]Λi − [Zext0 ]Λi‖

2 (a)=

√ ∑i∈Γedge

∑j∈Λi∩Γc

([Z0]Λi∩Γ − [Zext0 ]j)2

≤∑

i∈Γedge

∑j∈Λi∩Γc

∣∣∣[Z0]Λi∩Γ − [Zext0 ]j

∣∣∣ ,where [Z0]Λi∩Γ = 1

|Λi∩Γ|∑

j∈Λi∩Γ[Z0]j following the definition of “missing entries” in (2.9) and

step (a) holds since [Z0]Λi = [Zext0 ]Λi for i ∈ Γ \ Γedge . Label e = ε/(27Lτ0

√2d)

(1+4τ0√

2d). Using the above

the third term of (2.83) is bounded as

P

√∑N

i=1 ‖[Z0]Λi − [Zext0 ]Λi‖

2

√N

≥ e

≤ P∑i∈Γedge

∑j∈Λi∩Γc

∣∣∣[Z0]Λi∩Γ − [Zext0 ]j

∣∣∣√N

≥ e

(b)

≤ P

k ∑i∈Γedge

∣∣∣[Z0]Λi∩Γ

∣∣∣√N

+∑

i∈Γedge

∑j∈Λi∩Γc

|[Zext0 ]j |√N

≥ e

(c)

≤∑

i∈Γedge

P

(∣∣∣[Z0]Λi∩Γ

∣∣∣ ≥ √Ne

(2k)(2k)

)+∑

i∈Γedge

∑j∈Λi∩Γc

P

(∣∣[Zext0 ]j

∣∣ ≥ √Ne

2k(k + 1)

)(d)

≤ 4k exp

(−Nε

2

32k4

)+ 2k(k + 1) exp

(− Nε2

8k2(k + 1)2

). (2.84)

In the above, step (b) holds since∣∣∣[Z0]Λi∩Γ − [Zext

0 ]j

∣∣∣ ≤ ∣∣∣[Z0]Λi∩Γ

∣∣∣+|[Zext0 ]j | and |Λi∩Γc| ≤ k for

all i ∈ Γedge , step (c) follows from Lemma A.1.1 and the fact that |Γedge | = 2k and that there

are k(k + 1) terms in the double sum∑

i∈Γedge

∑j∈Λi∩Γc , step (d) follows from Lemma A.1.3

by noticing that [Z0]Λi∩Γ = 1|Λi∩Γ|

∑j∈Λi∩Γ[Z0]j

d= Z ′i where Z ′i ∼ N (0, 1). More specifically,

Z ′1, Z′2, . . . , Z

′k are dependent random variables that are marginally standard normal.

Finally, we consider T3c. Let m1 = E[x1] and m2 =√

E[x21], where we recall that x1 ∼ γ,

which is a probability measure on the bounded state space E, hence is sub-Gaussian. Following

45

similar method as for T3b above, we have

T3c ≤ P(‖x‖√N≥ 1 + m2

)+ P

(‖xext ‖√

N≥ 1 + m2

)

+ P

1√N

∑i∈Γedge

∑j∈Λi∩Γc

∣∣∣([x]Λi∩Γ −m1

)−(xextj −m1

)∣∣∣ ≥ ε/(27L√

2d)

(1 + 2(1 + m2)√

2d)

.

(2.85)

Then by (2.28), the first two terms in the above are upper bounded by Ke−κN using Lemma

A.1.2 and so we focus on the third term. Label e = ε/(27L√

2d)

(1+2(1+m2)√

2d), then using the above the

third term of (2.85) is bounded in a way similar to (2.84):

P

1√N

∑i∈Γedge

∑j∈Λi∩Γc

∣∣∣([x]Λi∩Γ −m1

)−(xextj −m1

)∣∣∣ ≥ e

≤∑

i∈Γedge

P

(∣∣∣[x]Λi∩Γ −m1

∣∣∣ ≥ √Ne

(2k)(2k)

)+∑

i∈Γedge

∑j∈Λi∩Γc

P

(∣∣xextj −m1

∣∣ ≥ √Ne

2k(k + 1)

).

The the desired result follows from the property of sub-Gaussian distribution stated in (2.71).

2.6 Conclusion

In this chapter, we provided a non-asymptotic state evolution analysis for AMP with non-

separable sliding-window denoisers when the multidimensional input signal is a realization of

an MRF defined on a rectangular lattice with a unique stationary distribution measure on a

bounded state space. More specifically, our result showed that the errors quantified by pseudo-

Lipschitz loss functions, for example, mean squared error, in the estimates generated by the

AMP algorithm concentrate to state evolution predictions with rate exponential in the size of

the unknown signal. We adjusted the definition of the state evolution sequences, which reflects

the fact that the window has “missing” elements when the denoiser is acting on the edges of

the finite lattice where the input signal is defined.

In the special case of 1D signals with time-homogeneous, reversible, and geometrically er-

godic Markov chain priors, we notice that when the window-size is fixed, the number of edge

entries is fixed. Using this fact, we provided an alternative non-asymptotic state evolution

analysis with a different definition of the state evolution sequences. The new state evolution

sequences are similar in form to the case when the input signal had an i.i.d. sub-Gaussian prior

46

(cf. Rush and Venkataramanan [104]).

47

Chapter 3

Application of Approximate

Message Passing with

Non-Separable Denoisers

1 Chapter 2 provided rigorous performance analysis of approximate message passing (AMP)

with sliding-window denoisers in a specific setting, namely, in the linear model (1.2), the matrix

A has i.i.d. Gaussian entries and the unknown signal x has a weakly dependent Markov random

field prior with a bounded state space. In some real-world applications, such conditions may

not be satisfied. For example, the prior for the unknown signal x may be better approximated

with a distribution defined on an unbounded state space, or the i.i.d. Gaussian matrix may

not be a good approximation to the physical sensing system. Nevertheless, as we will show in

this chapter that the empirical performance of AMP outperforms state-of-the-art methods in

several practical settings, even though in these settings, the rigorous performance guarantees

for AMP have not been established. In Section 3.1, we consider the case where x has a prior

defined on a possibly unbounded state space. We highlight that the distribution is not known

a priori. In Section 3.2, we consider compressive hyperspectral imaging with coded aperture

snapshot spectral imager (CASSI), where the sensing matrix is sparse and structured, hence

far from i.i.d. Gaussian.

1The work in this chapter was joint with Dror Baron [83, 84, 114–117], Junan Zhu [83, 84], Jin Tan [114–117], Hoover Rueda [116, 117], and Gonzalo Arce [116, 117]; it was funded in part by the National ScienceFoundation under grant CCF-1217749 and the U.S. Army Research Office under grants W911NF-14-1-0314 andW911NF-12-1-0380.

48

3.1 Approximate Message Passing with Universal Denoiser

Consider the linear inverse problem (1.2) where the unknown input signal x is generated by a

stationary and ergodic random process, while the probability distribution is unknown a priori.

Our goal is to design reconstruction algorithms that are universal to the input distribution. We

present a novel algorithmic framework that combines: (i) the AMP linear inverse algorithm,

which solves the linear inverse problem by iterative denoising; (ii) a universal denoising scheme

based on context quantization, which partitions the stationary ergodic signal denoising into

i.i.d. subsequence denoising; and (iii) a density estimation approach that approximates the

probability distribution of an i.i.d. sequence by fitting a Gaussian mixture model. We provide

two implementations of our proposed algorithm with one being faster and the other being more

accurate. The two implementations compare favorably with existing universal reconstruction

algorithms in terms of both reconstruction quality and runtime.

3.1.1 Related Work

When the prior distribution of the unknown signal x is known or well-approximated by some

known distribution, we can use a Bayesian sliding-window denoiser within AMP as demon-

strated in Section 2.2.3. When such prior information is unavailable, which is usually the case

in many practical applications, we may estimate the distribution from data. One possible ap-

proach is to assume a model class, and the density estimation then reduces to parameter es-

timation; Turbo-GAMP [134] is such an example. In some cases, model uncertainty makes it

difficult to select a parametric class, hence inference without model assumption is desirable.

While approaches based on Kolmogorov complexity [36, 37, 56, 57] are theoretically appealing

for universal signal recovery, they are not computable in practice [72]. Several algorithms based

on Markov chain Monte Carlo (MCMC) [5, 132] leverage the fact that for stationary ergodic

signals, both the per-symbol empirical entropy and Kolmogorov complexity converge almost

surely to the entropy rate of the signal [29], and aim to minimize the empirical entropy. The

best existing implementation of the MCMC approach [132] often achieves a mean squared error

(MSE) that is within 3 dB of the minimum mean squared error (MMSE), which resembles a

result by Donoho for universal denoising [37].

Recall that AMP solves linear inverse problems by iteratively applying a denoiser function

to estimate the unknown signal from i.i.d. Gaussian noise corrupted observations, hence we

may focus on designing denoisers, which will be used within AMP, for solving the following

49

denoising problem:

s = x + v, (3.1)

where the unknown signal x = xii∈Γ is stationary ergodic, and the noise v = vii∈Γ has

i.i.d. Gaussian entries with mean zero and variance σ2v . To simplify the presentation, we will

introduce our method in the 1D setting, namely Γ = [N ], but present numerical results for both

1D and 2D (image) settings. Because state evolution analysis implies that at each iteration,

the noise variance can be estimated by the normalized energy in the residual vector zt in (2.6),

namely 1n‖z

t‖2, we assume that σ2v in (3.1) is known.

Context quantization for universal denoising: Sivaramakrishnan and Weissman [109]

proposed a context quantization scheme for denoising stationary ergodic signals. The idea is to

quantize the noisy observations s to generate quantized contexts that are used to partition the

unquantized observations into subsequences. Precisely, given the noisy observations s, define

the context of sj as the vector cj =[[s]j−1j−k; [s]j+kj+1

]∈ R2k for j = 1 + k, ..., N − k, where [a]ji :=

(ai, ai+1, . . . , aj) for i ≤ j and [a1; a2] ∈ RN1+N2 denotes the concatenation of the vectors a1 ∈RN1 and a2 ∈ RN2 . For j ≤ k or j ≥ N − k+ 1, the median value of s, denoted by smed, is used

as the missing elements in the contexts. As an example for j = k, we only have k − 1 elements

in s before sk, and so the first element in ck is missing; we define ck :=[smed; [s]k−1

1 ; [s]2kk+1

].

Clustering techniques can then be applied to the context set C := cj : j ∈ [N ], and each

cj is assigned a label lj ∈ [L] that represents the cluster that cj belongs to. Finally, the L

subsequences that forms a partition of s are obtained by s(l) = sj : lj = l, for l ∈ [L]. We

denote the subsequences of x and v that correspond to s(l) by x(l) and v(l), respectively.

The entries in each subsequence s(l) are regarded as approximately conditionally identically

distributed given the common quantized contexts. The rationale underlying this concept is that

a sliding-window denoiser uses information from the contexts to estimate the current entry,

and therefore, entries with similar contexts can be grouped together and denoised using the

same denoiser. Sivaramakrishnan and Weissman [109] propose a second subsequencing step,

which further partitions each subsequence into smaller subsequences such that an element in a

subsequence does not belong to the contexts of any other elements in this subsequence. This

step ensures that the elements within each subsequence are mutually independent. Hence, the

denoising problem (3.1) is partitioned into i.i.d. subsequence denoising problems:

s(l) = x(l) + v(l), for l ∈ [L]. (3.2)

That is, in (3.2), x(l) is assumed to have i.i.d. entries following some probability distribution on

R that has a probability density function (pdf) f(l)x , and v(l) again has i.i.d. Gaussian entries.

50

In order to estimate f(l)x , Sivaramakrishnan and Weissman [109] first estimate the pdf f

(l)s of

s(l) via kernel density estimation. They then quantize the support and the function value of

empirical distribution function F(l)x of x(l), and optimize over a set of quantized distribution

functions such that∫ ∣∣∣(f (l)

x ? fv

)(x)− f (l)

s (x)∣∣∣ dx is minimized, where f

(l)x is the pdf associated

with F(l)x and ? denotes convolution. Once f

(l)x is obtained, the conditional expectation of the

entries in the lth subsequence can be calculated.

For error metrics that satisfy some mild technical conditions, Sivaramakrishnan and Weiss-

man [109] proved for stationary ergodic signals with bounded state space that their context

quantization based universal denoiser asymptotically achieves the optimal estimation error

among all sliding-window denoising schemes despite not knowing the prior for the signal.

Gaussian mixture learning: The pdf of a Gaussian mixture (GM) model is defined as:

f(x) =

R∑r=1

αr√2πσr

exp

(− x2

2σ2r

), (3.3)

where R is the number of Gaussian components, and∑R

r=1 αr = 1, so that f(x) is a valid

pdf. Figueiredo and Jain [41] propose to learn a GM model from a given data sequence by

starting with some arbitrarily large R, and inferring the structure of the mixture by letting

the mixing probabilities αr of some components be zero. This leads to an unsupervised learning

algorithm that automatically determines the number of Gaussian components from data. This

approach resembles the concept underlying the minimum message length (MML) criterion that

selects the best overall model from the entire model space, which differs from model class

selection based on the best model within each class.2 This criterion can be interpreted as

posing a Dirichlet prior on the mixing probability and perform maximum a posteriori (MAP)

estimation [41]. A component-wise expectation-maximization (EM) algorithm that updates

αr, µr, σ2r sequentially in r is used to implement the MML-based approach. The main feature

of the component-wise EM algorithm is that if αr is estimated as 0, then the rth component is

immediately removed, and the expectation is recalculated before moving to the estimation of

the next component.

3.1.2 Proposed Method

Consider a linear inverse problem (1.2), where the input signal x is stationary and ergodic

obeying some probability distribution that is unknown to the inverse algorithm. We propose to

2All models with the same number of components belong to one model class, and different models within amodel class have different parameters for each component.

51

Contextquantization

st,(1)

st,(2)

st,(L)

ηiid,t(st,(1))

ηiid,t(st,(2))

ηiid,t(st,(L))

𝑥𝑡+1, 1 , 𝜂𝑖𝑖𝑑,𝑡′ 𝑠𝑡, 1

Re-order

xt+1, ηiid,t′ (st)

st

ηuniv,t(st)

st = xt + A∗zt

zt = y − Axt +zt−1

nσi=1N ηiid,t−1

′ (sit−1)

AMP decoupling

…

y, A

ොx = xtMax

xt+1, 2 , ηiid,t′ (st,(2))

xt+1, L , ηiid,t′ (st,(L))

Figure 3.1 Flow chart of AMP-UD. AMP decouples the linear inverse problem into denoising prob-lems. In the tth iteration, the universal denoiser ηuniv,t(·) converts stationary ergodic signal denoisinginto i.i.d. signal denoising. Each i.i.d. denoiser ηiid,t(s

t,(l)) generates the denoised signal xt+1,(l) andthe derivative of the denoiser η′iid,t(s

t,(l)) for l ∈ [L]. The algorithm stops when the iteration index t

reaches the predefined maximum tMax, and outputs xtMax as the final result.

apply AMP with a universal denoiser (UD) to estimate x from y and A; we call our method

AMP-UD. At each iteration, AMP decouples the linear inverse problem (1.2) into a denoising

problem (3.1), where a universal denoiser ηuniv based on context quantization is applied. The

context quantization scheme partitions the original denoising problem (3.1) into subsequence

denoising problems (3.2) where the input signal is assumed to have i.i.d. entries in each sub-

sequence, and we apply a GM-based i.i.d. denoiser ηiid that computes conditional expectation

assuming a GM prior to solve each sub-problem. A flow chart of AMP-UD is shown in Fig-

ure 3.1. In the following, we provide details of the two building blocks of our method: our

modified universal denoiser ηuniv and the GM-based i.i.d. denoiser ηiid.

3.1.2.1 Denoising i.i.d. Signals based on GM Learning

Now consider the denoising problem (3.2) where the input signal is assumed to have i.i.d. entries.

We pose a GM prior on x(l) and learn the parameters of the GM model with a modified version

of the GM learning algorithm originally proposed by Figueiredo and Jain [41]. The modification

is needed because we only have access to the noisy samples s(l) of the GM model, whereas the

algorithm in [41] is for clean samples, namely, estimating the GM model directly from x(l)

instead of s(l).

GM learning from noisy samples: In this case, we need to introduce latent variables that

represent the underlying clean data x(l), and estimate the parameters of the GM for the latent

variables. Similar to the original algorithm, a component is removed only when the estimated

mixing probability is non-positive. The formulas for updating the GM parameters for x(l) in

52

the EM algorithm from noisy observations s(l) ∈ RNl are as follows, and detailed derivations

are presented in Appendix B.1,

αr(t+ 1) =

max Nl∑i=1

w(r)i (t)− 1, 0

∑

r:αr>0max

Nl∑i=1

w(r)i (t)− 1, 0

,

µr(t+ 1) =

Nl∑i=1

w(r)i (t)a

(r)i (t)

Nl∑i=1

w(r)i (t)

,

σ2r (t+ 1) =

Nl∑i=1

w(r)i (t)

(v

(r)i (t) +

(a

(r)i (t)− µr(t+ 1)

)2)

Nl∑i=1

w(r)i (t)

,

(3.4)

where

w(r)i (t) =

αr(t)N (s(l)i ; µr(t), σ

2v + σr(t)

2)R∑

m=1αm(t)N (s

(l)i ; µm(t), σ2

v + σ2m(t))

,

a(r)i (t) =

σ2r (t)

σ2r (t) + σ2

v

(s(l)i − µr(t)) + µr(t),

v(r)i (t) =

σ2vσ

2r

σ2v + σ2

r (t).

Notice that the model for the noisy samples s(l) is a GM convolved with Gaussian noise,

which is a new GM with larger component variances. Therefore, an alternative approach for

estimating the GM model for x(l) is to use the original algorithm [41] to first fit a GM to s(l),

and then subtract the noise variance σ2v from each Gaussian component of the estimated GM

model for s(l) to obtain the GM for x(l). During the parameter learning process, if a component

has variance that is less than 0.2σ2v , we assume that this low-variance component is spurious,

and remove it from the mixture model. However, if the component variance is between 0.2σ2v

and 0.9σ2v , then we force the component variance to be 0.9σ2

v and let the algorithm keep tracking

this component. For component variance greater than 0.9σ2v , we do not adjust the algorithm.

The parameters 0.2 and 0.9 are chosen, because they provide reasonable MSE performance for

a wide range of signals that we tested. These parameters are then fixed for our algorithm to

53

generate the numerical results in Section 3.1.3. At the end of the parameter learning process,

all remaining components with variances less than σ2v are set to have variances equal to σ2

v .

Therefore, when subtracting the noise variance σ2v from the Gaussian components of fs to obtain

the components of fx, we could have components with zero-valued variance, which yields deltas

in fx. Note that deltas are in general difficult to fit with a limited amount of observations,

and this modification helps the algorithm estimate deltas. We found in our simulation that the

second approach consistently converges faster and leads to lower reconstruction error, especially

for discrete-valued inputs, for which the pdf contains deltas. Therefore, the simulation results

presented in Section 3.1.3 use the second approach.

Initialization of EM: The EM algorithm must be initialized for each parameter, αr, µr, σ2r,

r ∈ [L]. One may choose to initialize the Gaussian components with equal mixing probabilities

and equal variances, and the initial value of the means are randomly sampled from the input

data sequence [41]. However, if the input signal is sparse, meaning most of the entries are zero-

valued, it becomes difficult to correct the initial value if the initialized values are far from the

truth. To see why a poor initialization might be problematic, consider the following scenario: a

sparse binary signal that contains a few ones and is corrupted by Gaussian noise is sent to the

algorithm. If the initialization levels of the µr’s are all around zero, then the algorithm is likely

to fit a Gaussian component with near-zero mean and large variance rather than two narrow

Gaussian components, one of which has mean close to zero while the other has mean close to

one.

To address this issue, we modify the initialization to examine the maximal distance between

each symbol of the input data sequence and the current initialization of the µr’s. If the distance

is greater than 0.1σs, then we add a Gaussian component whose mean is initialized as the value

of the symbol being examined, where σ2s is the empirical variance of the noisy observations

s(l). We found in our simulations that the modified initialization improves the accuracy of the

density estimation, and speeds up the convergence of the EM algorithm; the details of the

simulation are omitted for brevity.

Denoising: Once the parameters in (3.3) are estimated, we define a denoiser for i.i.d. signals

54

as conditional expectation:

ηiid(s) = E[X|S = s]

=R∑r=1

E[X|S = s, comp = r]P (comp = r|S = s)

=R∑r=1

(σ2r

σ2r + σ2

v

(s− µr) + µr

)αrN (µr, σ

2r + σ2

v)∑Ss=1 αrN (µr, σ2

r + σ2v), (3.5)

where comp is the component index, and

E[X|S = s, comp = r] =

(σ2r

σ2r + σ2

v

(q − µr) + µr

)is the Wiener filter for component r.

Recall that the derivative of the denoiser is need for the AMP algorithm (2.1), hence we

need to calculate the derivative of ηiid (3.5). For s ∈ R, denoting

f(s) =R∑r=1

αrN (µr, σ2r + σ2

v)

(σ2r

σ2r + σ2

v

(s− µr) + µr

),

g(s) =R∑r=1


v),

we have that

f ′(s) =

R∑r=1


v)

(σ2r + µ2

r − sµrσ2r + σ2

v

−(σr(s− µr)σ2r + σ2

v

)2),

g′(s) =

R∑r=1


v)

(− s− µrσ2r + σ2

v

).

Therefore,

η′iid(s) =f ′(s)g(s)− f(s)g′(s)

(g(s))2. (3.6)

We have verified numerically for several distributions and low to moderate noise levels that

the denoising results obtained by the GM-based i.i.d. denoiser (3.5) approach the minimum

mean squared error (MMSE) within a few hundredths of a dB. For example, the favorable

reconstruction results for i.i.d. sparse Laplace signals in Figure 3.2 show that the GM-based

denoiser approaches the MMSE.

55

3.1.2.2 Denoising Stationary Ergodic Signals

Our universal denoiser is a modified version of the one proposed by Sivaramakrishnan and

Weissman [109], the key idea of which is to denoise a stationary ergodic signal by (i) grouping

together entries with similar contexts and (ii) applying an i.i.d. denoiser to each group. Such a

scheme is asymptotically optimal. However, their denoiser assumes an input with known bounds,

which is partly due to their density estimation method that needs to quantize the support of

the distribution function. In order to be able to estimate signals that take values from the entire

real line, in step (ii), we apply our GM-based i.i.d. denoiser introduced in Section 3.1.2.1.

We now provide details about a modification made to step (i). The context set C is acquired

in the same way as in [109], described above in Section 3.1.1. Because the elements in the

context cj ∈ C that are closer in index to sj are likely to provide more information about xj

than the ones that are located further away, we add weights to the contexts before clustering.

That is, for each cj ∈ C of length 2k, the weighted context is defined as

c′j = diag(θ)cj ,

where

θki =

e−β(k−ki), ki = 1, .., k

e−β(ki−k−1), ki = k + 1, ..., 2k, (3.7)

for some β ≥ 0 and diag(θ) is a diagonal matrix with θ on it diagonal. It is observed that

when the noise level is high, it is necessary to use information from longer contexts, whereas

comparatively short contexts can be sufficient when the noise level is low. Therefore, the ex-

ponential decay rate β is made adaptive to the noise level in a way such that β increases with

SNR. Specifically, β is chosen to be linear in SNR:

β = b1 log10((‖s‖2/N − σ2v)/σ

2v) + b2, (3.8)

where b1 > 0 and b2 can be determined numerically. Specifically, we run the algorithm with

a sufficiently large range of β values for various input signals at various SNR levels and mea-

surement rates. For each setting, we select the β that achieves the best reconstruction quality.

Then, b1 and b2 are obtained using the least squares approach. Note that b1 and b2 are fixed

for all the simulation results presented in Section 3.1.3. If the parameters were tuned for each

individual input signal, then the optimal parameter values might vary for different input signals,

and the reconstruction quality might be improved. The simulation results in Section 3.1.3 with

fixed parameters show that when the parameters are slightly off from the individually optimal

56

ones, the reconstruction quality of AMP-UD is still comparable or better than the prior art. We

choose the linear relation because it is simple and fits well with our empirical optimal values

for β; other choices for β might be possible. The weighted context set C′ = c′j : j ∈ [N ] is

then sent to a k-means algorithm [87], and s(l), l ∈ [L], are obtained according to the labels

determined via clustering. We can now apply the GM-based i.i.d. denoiser (3.5) to each sub-

problem in (3.2). However, one potential problem is that the GM fitting algorithm might not

provide a good estimate of the model when the number of data points is small. We propose two

approaches to address this small cluster issue.

Approach 1: Borrow members from nearby clusters. A post-processing step can be

added to ensure that the pdf of s(l) is estimated from no less than T symbols. That is, if the

size of s(l), which is denoted by B, is less than T , then T −B symbols in other clusters whose

contexts are closest to the centroid of the current cluster are included to estimate the empirical

pdf of s(l), while after the pdf is estimated, the extra symbols are removed, and only s(l) is

denoised with the currently estimated pdf. We call UD with Approach 1 “UD1.”3

Approach 2: Merge statistically similar subsequences. An alternative approach is to

merge subsequences iteratively according to their statistical characterizations. The idea is to

find subsequences with pdfs that are close in relative entropy [29], and decide whether merging

them can yield a better model according to the minimum description length (MDL) [6] criterion.

Denote the iteration index for the merging process by t. After the k-means algorithm, we have

obtained a set of subsequences s(l)t : l ∈ [Lt], where Lt is the current number of subsequences.

A GM pdf f(l)t is learned for each subsequence s

(l)t ∈ RN

(l)t . The MDL cost cMDL

t for the current

model is calculated as:

cMDLt = −

Lt∑l=1

N(l)t∑i=1

log(f

(l)t ([s

(l)t ]i)

)+

Lt∑l=1

3 ·m(l)t

2log(N

(l)t

)+ 2 · Lt + L0 ·

Lt∑l=1

n(l)t

L0log

(L0

n(l)t

),

where m(l)t is the number of Gaussian components in the mixture model for subsequence s

(l)t ,

L0 is the number of subsequences before the merging procedure, and n(l)t is the number of

subsequences in the initial set s(l)0 : l ∈ [L0] that are merged to form the subsequence s

(l)t .

The four terms in cMDLt are interpreted as follows. The first term is the negative log likelihood

of the entire noisy sequence s given the current GM models. The second term is the penalty for

the number of parameters used to describe the model, where we have 3 parameters (α, µ, σ2)

3A related approach is k-nearest neighbors, where for each symbol in q, we find T symbols whose contexts arenearest to that of the current symbol and estimate its pdf from the T symbols. The k-nearest neighbors approachrequires to run the GM learning algorithm [41] N times in each AMP iteration, which significantly slows downthe algorithm.

57

for each Gaussian component, and m(l)t components for the subsequence s

(l)t . The third term

arises from 2 bits that are used to encode m(l)t for l ∈ [Lt], because our numerical results

have shown that the number of Gaussian components rarely exceeds 4. In the fourth term,∑Ltl=1

n(l)tL0

log

(L0

n(l)t

)is the uncertainty that a subsequence from the initial set is mapped to

s(l)t with probability n

(l)t /L0, for l ∈ [Lt]. Therefore, the fourth term is the coding length for

mapping the L0 subsequences from the initial set to the current set.

We then compute the relative entropy between the pdf of s(l)t and that of s

(h)t , for l, h ∈ [Lt]:

D(f

(l)t

∥∥∥ f (h)t

)=

∫f

(l)t (a) log

(f

(l)t (a)

f(h)t (a)

)da.

A symmetric Lt × Lt distance matrix Dt is obtained by letting its lth row and hth column be

D(f

(l)t

∥∥∥ f (h)t

)+D

(f

(h)t

∥∥∥ f (l)t

).

Suppose the smallest entry in the upper triangular part of Dt (not including the diagonal) is

located in the l∗-th row and h∗-th column, then s(l∗)t and s

(h∗)t are temporarily merged to form

a new subsequence, and a new GM pdf is learned for the merged subsequence. We now have

a new model with Lt+1 = Lt − 1 GM pdfs, and the MDL criterion cMDLt+1 is calculated for the

new model. If cMDLt+1 is smaller than cMDL

t , then we accept the new model, and calculate a new

Lt+1 × Lt+1 distance matrix Dt+1; otherwise we keep the current model, and look for the next

smallest entry in the upper triangular part of the current Lt×Lt distance matrix. The number

of subsequences is decreased by at most one after each iteration, and the merging process ends

when there is only one subsequence left, or the smallest relative entropy between two GM pdfs

is greater than some threshold, which is determined numerically. We call UD with Approach 2

“UD2.”

We will see in Section 3.1.3 that UD2 is more reliable than UD1 in terms of MSE perfor-

mance, whereas UD1 is faster than UD2. This is because UD2 applies a more complicated (and

thus slower) subsequencing procedure, which allows more accurate GM models to be fitted to

subsequences.

3.1.3 Numerical Results

We run AMP-UD1 (AMP with UD1) and AMP-UD2 (AMP with UD2) in Matlab R2013a

on a Dell OPTIPLEX 9010 running an Intel(R) CoreTM i7-3770 with 16GB RAM, and test

them utilizing different types of 1D signals, including synthetic signals, a chirp sound clip,

58

and a speech signal, at various measurement rates and SNR levels, where SNR is defined as

10 log10

((Nσ2

x)/(nσ2w))

with σ2x being the (empirical) variance of the signal x and σ2

w the

variance of the measurement noise w. The input signal length N is 10000 for synthetic signals

and roughly 10000 for the chirp sound clip and the speech signal. Moreover, we also test our

algorithm for compressive image reconstruction. The context size for the 1D case is chosen to

be 12 (k = 6), and the contexts are weighted according to (3.7) and (3.8), whereas for the 2D

(image) case, the context of a pixel is defined as the eight-nearest neighbors without weighting,

namely θ defined in (3.7) is an all-one vector. The context quantization is implemented via the

k-means algorithm [87]. In order to avoid possible divergence of AMP-UD, possibly due to a

bad GM fitting, we employ a damping technique [98] to slow down the evolution. Specifically,

damping is an extra step in the AMP iteration (2.2); instead of updating the value of xt+1 by

the output of the denoiser ηt(A∗zt + xt), a weighted sum of ηt(A

∗zt + xt) and xt is taken as

follows,

xt+1 = ληt(A∗zt + xt) + (1− λ)xt, (3.9)

for some λ ∈ (0, 1].

Parameters for AMP-UD1: The number of clusters L is initialized as 10, and may

become smaller if empty clusters occur. The lower bound T on the number of symbols required

to learn the GM parameters is 256. The damping parameter λ is 0.1, and we run 100 AMP

iterations.

Parameters for AMP-UD2: The initial number of clusters is set to be 30, and these

clusters will be merged according to the scheme described in Section 3.1.2.2. Because each time

when merging occurs, we need to apply the GM fitting algorithm one more time to learn a new

mixture model for the merged cluster, which is computationally demanding, we apply adaptive

damping [123] to reduce the number of iterations required; the number of AMP iterations is set

to be 30. The damping parameter is initialized to be 0.5, and will increase (decrease) within

the range [0.01, 0.5] if the value of the scalar channel noise estimator σ2t := ‖zt‖/n decreases

(increases).

The recovery performance is evaluated by the signal to distortion ratio (SDR), defined as

SDR = 10 log10(E[‖x‖2/‖x− x‖2]), (3.10)

where x is the estimate and the MSE := 1N ‖x− x‖2 is averaged over 50 realizations of (x,A,w).

For 1D signals, we compare the performance of the two AMP-UD implementations to (i) the

universal linear inverse algorithm SLA-MCMC [132]; and (ii) the empirical Bayesian message

passing approaches EM-GM-AMP-MOS [121] for i.i.d. inputs and turboGAMP [134] for non-

59

0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7

Measurement rate

0

2

4

6

8

10

12

14

16

18

20

Sig

nal to

dis

tort

ion r

atio (

dB

)

10dB MMSE

10dB AMP-UD2

10dB AMP-UD1

10dB SLA-MCMC

10dB EMGM-AMP-MOS

5dB MMSE

5dB AMP-UD2

5dB AMP-UD1

5dB SLA-MCMC

5dB EMGM-AMP-MOS

Figure 3.2 Comparison of the reconstruction results obtained by the two AMP-UD implementationsto those by SLA-MCMC and EM-GM-AMP-MOS for simulated i.i.d. sparse Laplace signals. Note thatthe SDR curves for the two AMP-UD implementations and EM-GM-AMP-MOS overlap the MMSE.

i.i.d. inputs. Note that EM-GM-AMP-MOS assumes during recovery that the input is i.i.d.,

whereas turboGAMP is designed for non-i.i.d. inputs with a known statistical model. We do

not include results for other well-known linear inverse algorithms such as compressive sensing

matching pursuit (CoSaMP) [90], gradient projection for sparse reconstruction (GPSR) [42],

or `1 minimization [22, 35], because their SDR performance is consistently weaker than the

three algorithms being compared. For 2D signals (images), we compare AMP-UD2 to AMP-

BM3D, namely AMP with the state-of-the-art image denoiser BM3D [31]. Note that BM3D

does not have an analytic formula, hence the derivative of the denoiser, which is required at

every iteration of AMP, is estimated empirically by the Monte Carlo approach proposed by

Metzler et.al. [88].

Sparse Laplace signal (i.i.d.): We tested i.i.d. sparse Laplace signals that has pdf fx(x) =

0.03L(0, 1) + 0.97δ0(x), where L(0, 1) denotes a Laplacian distribution with mean zero and

variance one, and δ0(·) is the Dirac delta function. It is shown in Figure 3.2 that the two

AMP-UD implementations and EM-GM-AMP-MOS achieve the MMSE [48, 97], whereas SLA-

MCMC has weaker performance, because the MCMC approach is expected to sample from the

posterior and its MSE is twice the MMSE [37, 132].

Dense Markov-Rademacher signal: Consider the two-state Markov state machine with

transition probability P (Si+1 = 1|Si = 0) = 370 and P (Si+1 = 0|Si = 1) = 1

10 . A dense Markov

Rademacher signal (MRad for short) takes values from −1,+1 with equal probability at

60

0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6

Measurement rate

5

10

15

20

25

30

35

40

45

Sig

nal to

dis

tort

ion r

atio (

dB

)

10dB AMP-UD2

10dB AMP-UD1

10dB SLA-MCMC

10dB turboGAMP

5dB AMP-UD2

5dB AMP-UD1

5dB SLA-MCMC

5dB turboGAMP

(a) Dense Markov-Rademacher

0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7

Measurement rate

8

10

12

14

16

18

20

22

24

Sig

nal to

dis

tort

ion r

atio (

dB

)

10dB AMP-UD2

10dB AMP-UD1

10dB SLA-MCMC

10dB turboGAMP

5dB AMP-UD2

5dB AMP-UD1

5dB SLA-MCMC

5dB turboGAMP

(b) Markov-Uniform

Figure 3.3 Comparison of the reconstruction results obtained by the two AMP-UD implementationsto those by SLA-MCMC and turboGAMP for simulated stationary ergodic signals.

nonzero state, namely when Si = 1. These parameters lead to 30% nonzero entries in an MRad

signal on average. Because the MRad signal is dense (non-sparse), we must measure it with

somewhat larger measurement rates and SNRs than before. It is shown in Figure 3.3a that

the two AMP-UD implementations and SLA-MCMC have better overall performance than

turboGAMP. AMP-UD1 outperforms SLA-MCMC except for the lowest tested measurement

rate at low SNR, whereas AMP-UD2 outperforms SLA-MCMC consistently.

Markov-uniform signal: Consider the two-state Markov state machine with transition

probability P (Si+1 = 1|Si = 0) = 3970 and P (Si+1 = 0|Si = 1) = 1

10 . A Markov-uniform signal

(MUnif for short) follows a uniform distribution U [0, 1] at the nonzero state. These parameters

lead to 3% nonzero entries in an MUnif signal on average. It is shown in Figure 3.3b that

at low SNR, the two AMP-UD implementations achieve higher SDR than SLA-MCMC and

turboGAMP. At high SNR, the two AMP-UD implementations and turboGAMP have similar

SDR performance, and are slightly better than SLA-MCMC. We highlight that turboGAMP

needs side information about the Markovian structure of the signal, whereas the two AMP-

UD implementations and SLA-MCMC do not. Chirp sound clip and speech signal: Our

experiments up to this point use synthetic signals. We now evaluate the reconstruction quality of

AMP-UD for two real-world signals. A “Chirp” sound clip and a speech signal are used. We cut a

segment with length 9600 out of the “Chirp” and a segment with length 10560 out of the speech

signal (denoted by x), and performed a short-time discrete cosine transform (DCT) with window

61

0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7

Measurement rate

3

4

5

6

7

8

9

10

11

12

13

Sig

nal to

Dis

tort

ion R

atio (

dB

)

10dB AMP-UD2

10dB SLA-MCMC

10dB AMP-UD1

10dB EM-GM-AMP-MOS

5dB AMP-UD2

5dB SLA-MCMC

5dB AMP-UD1

5dB EM-GM-AMP-MOS

(a) Chirp sound clip (N = 9600)

0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7

Measurement rate

-2

0

2

4

6

8

10

12

Sig

nal to

Dis

tort

ion R

atio (

dB

)

10dB AMP-UD2

10dB SLA-MCMC

10dB AMP-UD1

10dB EM-GM-AMP-MOS

5dB AMP-UD2

5dB SLA-MCMC

5dB AMP-UD1

5dB EM-GM-AMP-MOS

(b) Speech signal (N = 10560)

Figure 3.4 Comparison of the reconstruction results obtained by the two AMP-UD implementationsto those by SLA-MCMC and EM-GM-AMP-MOS for real-world signals.

size, number of DCT points, and hop size all being 32. The resulting short-time DCT coefficients

matrix are then vectorized to form a coefficient vector θ. Denoting the short-time DCT matrix

by W−1, we have θ = W−1x. Therefore, we can rewrite (1.2) as y = Φθ+w, where Φ = AW.

Our goal is to reconstruct θ from the measurements y and the matrix Φ. After we obtain the

estimated coefficient vector θ, the estimated signal is calculated as x = Wθ. Although the

coefficient vector θ may exhibit some type of memory, it is not readily modeled in closed form,

and so we cannot provide a valid model for turboGAMP [134]. Therefore, we use EM-GM-AMP-

MOS [121] instead of turboGAMP [134]. The SDRs for the two AMP-UD implementations,

SLA-MCMC and EM-GM-AMP-MOS [121] for the “Chirp” are plotted in Figure 3.4a and the

speech signal in Figure 3.4b. We can see that both AMP-UD implementations outperform EM-

GM-AMP-MOS consistently, which implies that the simple i.i.d. model is suboptimal for these

two real-world signals. Moreover, AMP-UD2 provides comparable and in most cases higher

SDR than SLA-MCMC, which indicates that AMP-UD2 is more reliable in learning various

statistical structures than SLA-MCMC. AMP-UD1 is the fastest among the four algorithms,

but it may have lower reconstruction quality than AMP-UD2 and SLA-MCMC, owing to poor

selection of the subsequences.

Images: Finally, we compare the performance of AMP-UD2 to AMP-BM3D for compressive

image reconstruction. The size of the images are 128× 128 and the measurements are noiseless

(w = 0). Four natural images, as shown in the first four columns of Figure 3.5, are used

62

Figure 3.5 Comparison of the reconstruction results obtained by AMP-UD2 to those by AMP-BM3Dfor images of size 128 × 128 from noiseless measurements. From top to bottom: ground-truth images,images reconstructed by AMP-UD2, and images reconstructed by AMP-BM3D. From left to right:the first four columns are natural images (δ = 0.3), and the last column is a realization of the MRFdefined in Section 2.2.3.1 (δ = 0.5).

as testing images, and the measurement rate is δ = 0.3. For a synthetic image, which is a

realization of the MRF defined in Section 2.2.3.1 and shown in the last column of Figure 3.5,

both AMP-UD2 and AMP-BM3D fail with δ = 0.3. Hence, we increase the measurement rate

to be δ = 0.5.

We can see from Figure 3.5 that for natural images (first four columns), AMP-BM3D and

AMP-UD2 are comparable. For the image that is a realization of an MRF (last column), AMP-

UD2 provides reliable reconstruction with an increased number of measurements, whereas AMP-

BM3D still fails. One advantage of AMP-UD2 over AMP-BM3D is that AMP-UD2 has an

analytic formula for the derivative of the denoiser, whereas AMP-BM3D relies on empirical

evaluation of the derivation using the Monte Carlo method [88].

Runtime: The runtime of AMP-UD1 and AMP-UD2 for MUnif, MRad, and the speech

signal is typically under 5 minutes and 10 minutes, respectively, but somewhat more for signals

such as sparse Laplace and the chirp sound clip that require a large number of Gaussian com-

ponents to be fit. For comparison, the runtime of SLA-MCMC is typically an hour, whereas

typical runtimes of EM-GM-AMP-MOS and turboGAMP are 30 minutes. To further accelerate

AMP, we could consider parallel computing. That is, after clustering, the Gaussian mixture

63

learning algorithm can be implemented simultaneously in different processors.

3.2 Approximate Message Passing with Adaptive Wiener Filter

In this section, we apply AMP with a non-separable adaptive Wiener filter denoiser to com-

pressive hyperspectral imaging, where 3D spatial-spectral information of a scene is sensed by a

coded aperture snapshot spectral imager (CASSI). The CASSI imaging process can be modeled

as suppressing 3D coded and shifted voxels and projecting these onto a 2D plane, so that the

number of acquired measurements is greatly reduced. This compressive sensing method on the

one hand significantly reduces the sensing time and memory needed to store the measurements,

which overcomes the limitations in conventional hyperspectral imagers. On the other hand, the

image reconstruction problem becomes quite ill-posed. Therefore, advanced inverse algorithms

need to be developed to realize the full potential of CASSI. While AMP is originally designed

for i.i.d. Gaussian sensing matrices and the matrix representation for CASSI is far from i.i.d.

Gaussian, numerical results demonstrate that our AMP-3D-Wiener algorithm provides higher

reconstruction quality than the standard algorithms that are commonly used in CASSI systems

given the same amount of runtime.

3.2.1 Problem Formulation

The sensing process of CASSI is described as follows. Let f0(x, y, λ) denote the voxel intensity

of a scene at spatial coordinate (x, y) and at wavelength λ, and let T (x, y) denote the coded

aperture. The coded density T (x, y)f0(x, y, λ) is spectrally shifted by the dispersive element

along one spatial dimension. The energy received by the focal plane array (FPA) at coordinate

(x, y) is then

g(x, y) =

∫ΛT (x, y − S(λ))f0(x, y − S(λ), λ)dλ, (3.11)

where S(λ) is the dispersion function induced by the prism at wavelength λ. Suppose we take

a scene of spatial dimension I by J and spectral dimension L, i.e., the dimension of the image

cube is I×J×L, and the dispersion is along the second spatial dimension y, then the number of

measurements captured by the FPA will be n = I(J+L+1) for one shot and n = KI(J+L+1)

for K shots. The corresponding discrete system is then

y = Hf0 + w, (3.12)

64

𝐼𝐼𝐼𝐼 (1st band) 𝐼𝐼𝐼𝐼 (2nd band) 𝐼𝐼𝐼𝐼 (3rd band) 𝐼𝐼𝐼𝐼 (4th band)

I(𝐼𝐼 + 𝐿𝐿 + 1) 1st shot

I(𝐼𝐼 + 𝐿𝐿 + 1) 2nd shot

Figure 3.6 The matrix H is presented for K = 2, I = J = 8, and L = 4. The circled diagonalpatterns that repeat horizontally correspond to the coded aperture pattern used in the first FPA shot.The second coded aperture pattern determines the next set of diagonals.

where f0 ∈ RN is the vectorized 3D image cube of dimension N = IJL, vectors y ∈ Rn and

w ∈ Rn are the measurements and the additive noise, respectively, and the matrix H is a linear

operator that models the integral in (3.11). An example of the matrix H is presented in Figure

3.6 and details of this modeling can be found in [4].

Let x = Ψf0, where Ψ is an invertible sparsifying transform. Then (3.12) can be written as

y = HΨ−1x + w. (3.13)

While many choices of Ψ are possible, we choose a convenient orthonomal transform that

consists of a wavelet transform to each of the 2D images in a 3D cube, and then applies a

discrete cosine transform (DCT) along the spectral dimension. This transform has been shown

to yield good reconstruction by Arguello and Arce [3].

3.2.2 Proposed Method

We propose to use AMP with a Wiener filter denoiser to estimate x from (3.13), and call

the proposed algorithm AMP-3D-Wiener. Define A := HΨ−1, then the AMP algorithm follows

(2.1) and (2.2). Similar to before, at the tth iteration, we obtain the auxiliary observation vector

st = A∗zt + xt and generate the estimate by xt = ηt(st).

65

We define the denoiser ηt as the following adaptive Wiener filter:

xti = ηt(st) =

max0, ν2i,t − σ2

t (ν2i,t − σ2

t ) + σ2t

(sti − µi,t

)+ µi,t

=max0, ν2

i,t − σ2t

ν2i,t

(sti − µi,t

)+ µi,t, (3.14)

where σ2t = 1

n‖zt‖2, µi,t and ν2

i,t are the empirical mean and variance of sti within an appropriate

wavelet subband, respectively, hence are functions of all the entries of st in that subband. Note

that the Wiener filter (3.14) can be thought of as the conditional expectation of xi given stiunder the assumption that xi has a Gaussian prior with mean µi,t and variance (ν2

i,t − σ2t ) and

the noise is Gaussian with mean zero and variance σ2t .

3.2.3 Numerical Results

The simulations is performed for the scene shown in Figure 3.7. This data cube was acquired

using a wide-band Xenon lamp as the illumination source, modulated by a visible monochro-

mator spanning the spectral range between 448 nm and 664 nm, and each spectral band has

9 nm width. The image intensity was captured using a grayscale CCD camera, with pixel size

9.9 µm, and 8 bits of intensity levels. The resulting test data cube has I × J = 256× 256 pixels

of spatial resolution and L = 24 spectral bands. The measurements y are captured with K = 2

complementary shots, hence the measurement rate is n/N = KI(J +L+ 1)/(IJL) ≈ 0.09. The

signal to noise ratio (SNR) is defined as 10 log10(µg/σw) [3], where µg is the mean value of the

measurements Hf0 and σw is the standard deviation of the additive noise w. In our simulation,

we add measurement noise such that the SNR is 20 dB.

Figure 3.8a compares the reconstruction quality of AMP-3D-Wiener, gradient projection

for sparse reconstruction (GPSR) [42], and two-step iterative shrinkage/thresholding (TwIST)

[15, 124] within a certain amount of runtime. Runtime is measured on a Dell OPTIPLEX 9010

running an Intel(R) CoreTM i7-860 with 16GB RAM, and the environment is Matlab R2013a.

Figure 3.8b complements Figure 3.8a by illustrating the peak signal to noise ratio (PSNR) of

each 2D slice in the reconstructed cube separately. The PSNR is defined as the ratio between

the maximum squared value of the ground truth image cube f0 and the mean squared error

(MSE) of the estimation f . Formally,

PSNR = 10 · log10

maxx,y,λ

(f2

0,(x,y,λ)

)∑

x,y,λ

(f(x,y,λ) − f0,(x,y,λ)

)2

,

66

452 nm 461 nm 470 nm 479 nm 488 nm 497 nm

506 nm 515 nm 524 nm 533 nm 542 nm 551 nm

560 nm 569 nm 578 nm 587 nm 596 nm 605 nm

614 nm 623 nm 632 nm 641 nm 650 nm 659 nm

Figure 3.7 The Lego scene. (The target object presented in the experimental results was not endorsedby the trademark owners and it is used here as fair use to illustrate the quality of reconstruction ofcompressive spectral image measurements. LEGO is a trademark of the LEGO Group, which doesnot sponsor, authorize or endorse the images in this work. The LEGO Group. All Rights Reserved.http://aboutus.lego.com/en-us/legal-notice/fair-play/.)

where f(x,y,λ) denotes the element in the cube f at spatial coordinate (x, y) and spectral coordi-

nate λ. It is shown that the cube reconstructed by AMP-3D-Wiener has 2− 4 dB higher PSNR

than the cubes reconstructed by GPSR and 0.4 − 3 dB higher than those of TwIST for all 24

slices.

3.3 Conclusion

In this chapter, we studied the empirical performance of AMP with non-separable denoisers

in the context of universal compressed sensing (imaging) and hyperspectral imaging. While

existing performance analyses of AMP, including the ones that we provided in Chapter 2, do not

cover the experimental settings considered in this chapter, the promising empirical performance

of AMP presented here makes us optimistic about the potential of the AMP framework and

67

Time (sec)0 100 200 300 400

PS

NR

(dB

)

0

5

10

15

20

25

AMP-3D-WienerTwISTGPSR

(a) Runtime versus average PSNR

Spectral band0 5 10 15 20 25

PS

NR

(dB

)

21

22

23

24

25

26

27

28

AMP-3D-WienerTwISTGPSR

(b) Spectral band versus PSNR

Figure 3.8 Comparison of AMP-3D-Wiener, GPSR, and TwIST for the Lego image cube. Cube size isI = J = 256, and L = 24. The measurements are captured with K = 2 shots using complementaryrandom coded apertures, and the number of measurements is n = 143, 872. Random Gaussian noise isadded to the measurements such that the SNR is 20 dB.

motivates us to further study AMP from both algorithmic and theoretical perspectives.

For universal compressed sensing (imaging), we designed a universal denoiser, which com-

bines concepts from context quantization and unsupervised Gaussian mixture learning, and ap-

plied our universal denoiser within AMP to reconstruct 1D and 2D signals from measurements

acquired by an i.i.d. Gaussian measurement matrix. Two implementations of our AMP-UD algo-

rithm were proposed with one being faster and the other being more accurate. For both 1D and

2D settings, we compared our AMP-UD algorithm to several state-of-the-art algorithms with

simulated and real-world signals. For 1D signal reconstruction, numerical results showed that

AMP-UD outperformed the comparison algorithms in both reconstruction quality and runtime.

For 2D signal (image) reconstruction, AMP-UD achieved comparable results as AMP-BM3D

for natural images, while AMP-UD was more stable for simulated images that are realizations

of a Markov random field.

For hypersepctral imaging, we used an adaptive Wiener filter as the denoiser in AMP to

reconstruct 3D hyperspectral image cubes from measurements acquired by the compressive hy-

perspectral imager CASSI. Our adaptive Wiener filter performs denoising in an orthonormal

sparsifying transform domain and estimates the empirical mean and variance of a coefficient

from an appropriate neighborhood, which is then followed by Wiener filtering. Numerical re-

sults showed that given the same amount of runtime, the proposed AMP-3D-Wiener algorithm

68

provided better reconstruction quality than GPSR and TwIST, which are standard methods

commonly used for hyperspectral imaging.

69

Chapter 4

Fast Computation in Linear Inverse

Problems

1 In this chapter, we consider the problem of fast computation for large-scale linear inverse

problems (1.2). In Section 4.1, we implement approximate message passing (AMP) with column-

wise partitioning of the matrix A, and provide a state evolution analysis of AMP under this

setting. In Section 4.2, we consider the case when the input signal has a large number of zero-

valued entries, and propose a two-part framework, where in Part 1, a simple and fast algorithm

is used to detect zero-valued entries in the unknown signal x, and in Part 2, the rest of the

entries are reconstructed by a high-fidelity linear inverse algorithm such as AMP. Such a two-

part scheme naturally leads to a trade-off analysis between runtime and reconstruction quality.

4.1 Multiprocessor Approximate Message Passing with Column-

Wise Partitioning

Solving a large-scale linear inverse problem using multiple processors is important in various

real-world applications due to the limitations of individual processors and constraints on data

sharing policies. In some scenarios, it might be desirable to partition the matrix A either

column-wise or row-wise and store the sub-matrices at different processors. The partitioning

style depends on data availability, computational considerations, and privacy concerns. For

1The work in this chapter was joint with Dror Baron [78–80], Yue Lu [80], and Deanna Needell [78, 79];it was funded in part by the National Science Foundation under grants CCF-1217749 and CCF-1319140, theU.S. Army Research Office under grants W911NF-04-D-0003, W911NF-14-1-0314, and W911NF-16-1-0265, theNational Science Foundation Career grant # 1348721, Simons Foundation Collaboration grant # 274305, andthe Alfred P. Sloan Research Fellowship.

70

example, in high-dimensional settings where N n, or in situations where the columns of A,

which represent features in feature selection problems [54], cannot be shared among processors

for privacy preservation, column-wise partitioning might be preferable. This section focuses on

the setting where the matrix is partitioned column-wise. We extend the algorithmic framework

and the theoretical analysis of approximate message passing (AMP). In particular, we show

that column-wise multiprocessor AMP (C-MP-AMP) obeys an state evolution under the same

assumptions when the state evolution for AMP holds. The state evolution results imply that

(i) the state evolution of C-MP-AMP converges to a state that is no worse than that of AMP

and (ii) the asymptotic dynamics of C-MP-AMP and AMP can be identical.

4.1.1 Definition of the Algorithm

Consider multiprocessor computing for the (non-overlapping) column-wise partitioned linear

inverse problem:

y =L∑l=1

Alxl + w, (4.1)

where L is the total number of processors, Al ∈ Rn×Nl is the sub-matrix that is stored in

Processor l, and∑L

l=1Nl = N .

Many studies on solving the column-wise partitioned linear inverse problem (4.1) have been

in the context of distributed feature selection. Zhou et al. [129] modeled feature selection as

a parallel group testing problem. Wang et al. [125] proposed to de-correlate the data matrix

before partitioning, and each processor then works independently using the de-correlated matrix

without communication with other processors. Peng et al. [93] studied problem (4.1) in the

context of optimization, where they proposed a greedy coordinate-block descent algorithm and

a parallel implementation of the fast iterative shrinkage/thresholding algorithm (FISTA) [8].

It is worth mentioning that row-wise multiprocessor AMP [51–53] obeys the same state evo-

lution as AMP, because it distributes the computation of matrix-vector multiplication among

multiple processors and aggregates the results before any other operations. Some existing work

on row-wise multiprocessor AMP [53, 131, 133] introduces lossy compression to the communica-

tion between processors and the fusion center, whereas we assume perfect communication and

focus on the theoretical justifications and implications of the new state evolution of C-MP-AMP.

In our C-MP-AMP algorithm, the fusion center collects vectors that represent the esti-

mations of the portion of the measurement vector y contributed by the data from individual

processors according to a pre-defined communication schedule. The sum of these vectors is com-

puted in the fusion center and transmitted to all processors. Each processor performs standard

71

AMP iterations with a new equivalent measurement vector, which is computed using the vector

received from the fusion center. The pseudocode for C-MP-AMP is presented in Algorithm 1.

Algorithm 1 C-MP-AMP

Inputs to Processor l: y, Al, kss=0,...,s (number of inner iterations at each outer iteration).

Initialization: x0,k0

l = 0, z0,k0−1l = 0, r0,k0

l = 0, ∀l ∈ [L].

for s = 1 : s do (loop over outer iterations)

At the fusion center: gs =∑L

u=1 rs−1,ks−1u

At Processor l:

xs,0l = xs−1,ks−1

l , rs,0l = rs−1,ks−1

l

for t = 0 : ts − 1 do (loop over inner iterations)

zs,kl = y − gs −(rs,kl − rs,0l

)xs,k+1l = ηs,k(x

s,kl + A∗l z

s,kl )

rs,k+1l = Alx

s,k+1l − zs,kl

n

∑Nli=1 η

′s,k([x

s,kl + A∗l z

s,kl ]i).

Output from Processor l: xs,ksl .

4.1.2 Performance Analysis

4.1.2.1 Assumptions

The assumptions under which our result for C-MP-AMP holds are the same as the assumptions

made for the analysis of AMP by Rush and Venkataramanan [104]. These assumptions are

restated below for convenience.

Signal: The entries of the unknown signal x are i.i.d. according to a sub-Gaussian distri-

bution px.

Matrix: The entries of the measurement matrix A are i.i.d. according to N (0, 1/n).

Denoiser: The denoisers ηs,ks,k≥0 : R→ R are Lipschitz, and hence weakly differentiable.

The weak derivative, which is denoted by η′s,k, is assumed to be differentiable except possibly

at a finite number of points, with bounded derivative wherever it exists.

Noise: The entries of the measurement nose w are i.i.d. according to a sub-Gaussian dis-

tribution pw with zero-valued mean and variance σ2w.

72

4.1.2.2 Main Result

Similar to AMP, the dynamics of the C-MP-AMP algorithm can be characterized by a state

evolution formula. Let (σ0,k0

l )2 = δ−1l E[X2], where X ∼ px and δl = n/Nl, ∀l ∈ [L]. For outer

iterations 1 ≤ s ≤ s and inner iterations 0 ≤ t ≤ ks as defined in Algorithm 1, we define the

deterministic state evolution sequences (σs,kl )2 and (τ s,kl )2 as

(σs,0l )2 = (σs−1,ksl )2, (4.2)

(τ s,kl )2 = σ2w +

∑u6=l

(σs,0u )2 + (σs,kl )2, (4.3)

(σs,k+1l )2 = δ−1

l E[(ηs,k(X + τ s,kl Z)−X

)2], (4.4)

where Z ∼ N (0, 1) is independent of X ∼ px.

Theorem 4.1.1 provides the performance guarantee for C-MP-AMP.

Theorem 4.1.1. Under the assumptions listed in Section 4.1.2.1, let L be a fixed integer. For

l = 1, ..., L, let n/Nl = δl ∈ (0,∞) be a constant. Define N =∑L

l=1Nl. Then for any (order 2)

pseudo-Lipschitz function φ : R2 → R, we have ∀ε ∈ (0, 1), there exist constants Ks,k, κs,k > 0

independent of n, ε, such that for any l ∈ [L],

P

(∣∣∣∣∣ 1

Nl

Nl∑i=1

φ([xs,k+1l ]i, [xl]i)− E

[φ(ηs,k(X + τ s,kl Z), X)

]∣∣∣∣∣ ≥ ε)≤ Ks,ke

−κs,knε2 ,

where xs,k+1l is generated by Algorithm 1, τ s,kl is defined in (4.3), and X ∼ px is independent

of Z ∼ N (0, 1).

Remark 4.1.1. (C-MP-AMP converges to a fixed point that is no worse than that of AMP.)

Let τ2l l∈[L], σ2

l l∈[L] denote the set of fixed points of (4.2)–(4.4). Then (4.3) becomes τ2l =

σ2w +

∑Lu=1 σ

2u, which leaves the RHS of (4.3) independent of l, hence, τ2

l are equal for all

l ∈ [L]. Let τ2 denote τ2l , and plug (4.4) into (4.3), we have

τ2 = σ2w +

P∑p=1

δ−1p E

[(η(X + τZ)−X)2

](a)= σ2

w + δ−1E[(η(X + τZ)−X)2

],

which is identical to the fixed point equation obtained from (2.3). In the above, step (a) holds

because∑L

l=1 δ−1l =

∑Ll=1

Nln = N

n . Because AMP always converges to the worst fixed point

of the fixed point equation (2.3) [65], the average asymptotic performance of C-MP-AMP is

73

identical to AMP when there is only one solution to the fixed point equation, and at least as

good as AMP in case of multiple fixed points.

Remark 4.1.2. (The asymptotic dynamics of C-MP-AMP can be identical to AMP with a

specific communication schedule.) Let ks = 1, ∀s. In this case, the quantity (τ s,kl ) is involved

only for t = 0. The (4.3) becomes (τ s,0l )2 = σ2w +

∑Lu=1(σs,0l )2, (τ s,0l )2, hence (τ s,0l )2’s are again

identical for all l ∈ [L]. Let (τ s)2 denote (τ s,0l )2, then (4.2–4.4) can be written as

(τ s)2 = σ2w +

L∑l=1

δ−1l E

[(ηs(X + τ sZ)−X)2

]= σ2

w + δ−1E[(ηs−1(X + τ s−1Z)−X

)2],

where the iteration evolves over s, which is identical to (2.3) evolving over t.


4.1.3.1 Proof Notations

Without loss of generality, we assume the sequence kss≥0 in Algorithm 1 to be a constant

value k. Let t = sk + k, θ(t) = bt/kck. Given w and xl in (4.1), for l ∈ [L], define RNl-valued

random vectors ht+1l ,qtl and Rn-valued random vectors btl ,m

tl for t ≥ 0 recursively as follows.

Starting with q0l = −xl, iteratively define

ht+1l = A∗lm

tl − qtl , [qtl ]i = ft([h

tl ]i, [xl]i), ∀i ∈ [N ],

btl = Alqtl − λtlmt−1

l , mtl = btl +

∑u6=l

bθ(t)u −w, (4.5)

where

ft([htl ]i, [xl]i) := ηt−1([xl]i − [htl ]i)− [xl]i, and λtl :=

1

δlNl

Nl∑i=1

f ′t([htl ]i, [xl]i). (4.6)

In (4.6), the derivative of ft : R2 → R is w.r.t. the first argument. We assume that ηt is Lipschitz

for all t ≥ 0, then it follows that ft is Lipschitz for all t ≥ 0. Consequently, the weak derivative

f ′t is well-defined. Further, f ′t is assumed to be differentiable, except possibly at a finite number

of points, with bounded derivative whenever it exits. In (4.5), quantities with negative indices

or with index θ(t) = 0 (i.e., t < k) are set to be zero.

To see the equivalence between Algorithm 1 and the recursion defined in (4.5) and (4.6), we

74

let x0l = 0, r0

l = 0, ztl = 0, and

ht+1l = xl − (A∗ztl + xtl), qtl = xtl − xl, btl = rtl −Alxl, mt

l = −ztl .

Let (σ0l )

2 = δ−1l E[X2]. We assume that (σ0

l )2 is strictly positive for all l ∈ [L]. Note that

the sub-Gaussian assumption on x, where we recall a property of sub-Gaussian distributions in

(2.13), and the definition that q0l = −xl imply that and for all ε ∈ (0, 1), there exist K,κ > 0

such that

P

(∣∣∣∣‖q0l ‖2

n− (σ0

l )2

∣∣∣∣ ≥ ε) ≤ Ke−κnε2 , ∀l ∈ [L].

Define the state evolution sequences τ tl t≥0 and σtlt≥1 for recursion (4.5) as follows:

(τ tl )2 = (σtl )

2 +∑u6=l

(σθ(t)u )2 + σ2w, (σtl )

2 =1

δlE[(ft(τ

t−1l Z,X)

)2], (4.7)

where Z ∼ N (0, 1) and X ∼ px are independent. Notice that the state evolution sequences

defined in (4.7) matches (4.2) - (4.4).

Writing the updating equations for btl ,ht+1l defined in (4.5) in matrix form, we have

Xtl = A∗lM

tl , Yt

l = AlQtl ,

where

Xtl = [h1

l + q0l |h2

l + q1l | · · · |htl + qt−1

l ], Ytl = [b0

l |b1l + λ1

lm0l | · · · |bt−1

l + λt−1l mt−2

l ],

Mtl = [m0

l |m1l | · · · |mt−1

l ], Qtl = [q0

l |q1l | · · · |qt−1

l ].

Let (mtl)|| and (qtl)|| denote the projection of mt

l and qtl onto the column space of Mtl and

Qtl , respectively. That is,

(mtl)|| = Mt

l

((Mt

l)∗Mt

l

)−1(Mt

l)∗mt

l , (qtl)|| = Qtl

((Qt

l)∗Qt

l

)−1(Qt

l)∗qtl .

Let αtl and γtl be the coefficient vectors of these projections:

αtl =

((Mt

l)∗Mt

l

)−1(Mt

l)∗mt

l , γtl =((Qt

l)∗Qt

l

)−1(Qt

l)∗qtl .

Moreover, define

(mtl)⊥ = mt

l − (mtl)||, (qtl)⊥ = qtl − (qtl)||.

75

4.1.3.2 Concentration Constants

Let Ztl t≥0 and Ztl t≥0 each be a sequence of zero-mean jointly Gaussian random variables

whose covariance is defined recursively as follows. For t, r ≥ 0,

E[Zrl Ztl ] =

Er,tlσrl σ

tl

, E[Zrl Ztl ] =

Er,tlτ rl τ

tl

, (4.8)

where

Er,tl = Er,tl +∑u6=l

Eθ(r),θ(t)u + σ2w, Er,tl = δ−1

l E[fr(τ

r−1l Zr−1

l , X)ft(τt−1l Zt−1

l , X)]. (4.9)

Moreover, Zrl is independent of Ztr and Zrl is independent of Ztr when l 6= r. Note that according

(4.7), we have Et,tl = (τ tl )2, Et,tl = (σtl )

2, and E[(Ztl )2] = E[(Ztl )

2] = 1. In (4.9), quantities with

negative indices or with either θ(t) = 0 or θ(r) = 0 are set to be zero.

Define matrices Ctl , C

tl ∈ Rt×t and vectors Et

l , Etl ∈ Rt, for l ∈ [L], whose entries are Er,tl

and Er,tl , as follows

[Ctl ]r+1,s+1 = Er,sl , [Ct

l ]r+1,s+1 = Er,sl , ∀r, s = 0, ..., t− 1.

Etl = (E0,t

l , E1,tl , ..., Et−1,t

l ), Etl = (E0,t

l , E1,tl , ..., Et−1,t

l ).

Define the concentrating values for αtl and γtl as

γtl = (Ctl)−1Et

l , αtl = (Ct

l)−1Et

l .

Let (σ0l )

2⊥ = (σ0

l )2 and (τ0

l )2⊥ = (τ0

l )2, and for t > 0, define

(σtl )2⊥ = (σtl )

2 − (γtl )∗Et

l = (σtl )2 − (Et

l)∗(Ct

l)−1Et

l ,

(τ tl )2⊥ = (τ tl )

2 − (αtl)∗Et

l = (σtl )2 − (Et

l)∗(Ct

l)−1Et

l .

Lemma 4.1.1. The matrices Ctl and Ct

l , ∀t ≥ 0, defined above are invertible, and the scalars

(σtl )2⊥ and (τ tl )

2⊥, ∀t ≥ 0, defined above are strictly positive.

Proof. The proof for Ctl being invertible and (σtl )

2⊥ being strictly positive is the same as in

[104]. Now consider Ct+1l . Notice that Ct+1

l is the sum of a positive definite matrix (Ct+1l ) and

L positive semi-definite matrices, hence, Ct+1l is positive definite. Consequently,

det(Ct+1l ) = det(Ct

l) det((τ tl )2 − (Et

l)∗(Ct

l)−1Et

l) > 0,

76

which implies (τ tl )2 − (Et

l)∗(Ct

l)−1Et

l = (τ tl )2⊥ > 0.

4.1.3.3 Conditional Distribution Lemmas

Let the sigma algebra S t1,t be generated by x,w,b0l , ...,b

t1−1l ,m0

l , ...,mt1−1l ,h1

l , ...,htl ,q

0l , ...,q

tl ,

∀l ∈ [L]. We now compute the conditional distribution of (A1, . . . ,AL) given S t1,t, where t1 is

either t or t+ 1.

Notice that conditioning on S t1,t is equivalent to conditioning on the linear constraints:

AlQt1l = Yt1

l , A∗lMtl = Xt

l , for l ∈ [L],

where only Al, for l ∈ [L], are treated as random.

Let P‖Qt1l

= Qt1l ((Qt1

l )∗Qt1l )−1Qt1

l and P‖Mtl

= Mtl((M

tl)∗Mt

l)−1Mt

l , which are the projec-

tors onto the column space of Qt1l and Mt

l , respectively. The following lemma provides the

conditional distribution of the matrices (A1, . . . ,AL) given S t1,t.

Lemma 4.1.2. For t1 = t or t+1, the conditional distribution of the random matrices (A1, . . . , AL)

given S t1,t satisfies

(A1, ...,AL)|S t1,td= (Et1,t1 + P⊥Mt

1A1P

⊥Qt11

, ...,Et1,tL + P⊥MtLALP

⊥Qt1L

),

where P⊥Qt1l

= I−P‖Qt1l

and P⊥Mtl

= I−P‖Mtl. Al

d= Al and Al is independent of S t1,t. Moreover,

Al is independent of Ar for l 6= r. Et1,tl is defined as

Et1,tl = Yt1l ((Qt1

l )∗Qt1l )−1(Qt1

l )∗ + Mtl((M

tl)∗Mt

l)−1(Xt

l)∗

−Mtl((M

tl)∗Mt

l)−1(Mt

l)∗Yt1

l ((Qt1l )∗Qt1

l )−1(Qt1l )∗.

Proof. See Appendix C.1.

Combining the results in Lemma 4.1.2 and [104, Lemma 4], we have the following conditional

distribution lemma.

Lemma 4.1.3. For the vectors ht+1l and btl defined in (4.5), the following holds for t ≥ 1,

77

l ∈ [L]:

b0l |S 0,0 d

= (σ0l )⊥Z

′0l + ∆0,0

l , h1l |S 1,0 d

= (τ0l )⊥Z0

l + ∆1,0l ,

btl |S t,t d=

t−1∑i=0

γtl,ibil + (σtl )⊥Z

′tl + ∆t,t

l , htl |S t+1,t d=

t−1∑i=0

αtl,ihi+1l + (τ tl )⊥Ztl + ∆t+1,t

l ,

where

∆0,0l =

(‖(q0

l )⊥‖√n− (σ0

l )⊥

)Z′0l ,

∆1,0l =

[(‖(m0

l )⊥‖√n

− (τ0l )⊥

)I−‖(m0

l )⊥‖√n

P‖q0l

]Z0l + q0

l

(‖q0

l ‖2

n

)−1((b0

l )∗(m0

l )⊥n

−‖q0

l ‖2

n

),

∆t,tl =

t−1∑i=0

(γtl,i − γtl,i

)bil +

[(‖(qtl)⊥‖√

n− (σtl )⊥

)I−‖(qtl)⊥‖√

nP‖Mtl

]Z′tl

+ Mtl

((Mt

l)∗Mt

l

n

)−1(

(Htl)∗(qtl)⊥n

−(Mt

l)∗

n

[λtlm

t−1l −

t−1∑i=1

λtl,iγtl,im

i−1l

]),

∆t+1,tl =

t−1∑i=0

(αtl,i − αtl,i

)hi+1l +

[(‖(mt

l)⊥‖√n

− (τ tl )⊥

)I−‖(mt

l)⊥‖√n

P‖Qt+1l

]Ztl

+ Qt+1l

((Qt+1

l )∗Qt+1l

n

)−1((Bt+1

l )∗(mtl)⊥

n−

(Qt+1l )∗

n

[qtl −

t−1∑i=0

αtl,iqil

]),

where Z′tl and Ztl are Rn-valued and RNl-valued random vectors, respectively, with i.i.d. N (0, 1)

entries, and are independent of the corresponding sigma algebras. Moreover, Z′tl is independent

of Z′tr and Ztl is independent of Ztr when l 6= r.

Proof. The proof of the expressions of the conditional distributions above for each individual

processor is the same as in [104]. The independent relationship between Z′tl and Z

′tr , as well as

between Ztl and Ztr, for l 6= r, is obtained by including the result that Ar is independent of Al

as established in Lemma 4.1.2 into the proof.

Note that the conditional distribution of ht+1l t≥0 and btlt≥0 established in Lemma 4.1.3

for C-MP-AMP is similar in form to that of AMP provided by Rush and Venkataramanan

[104, Lemma 4.3]. Because of this similarity, the extension of [104, Lemma 4.4], which directly

leads to the performance guarantee of AMP in [104, Theorem 3.1], to proving the performance

guarantee of C-MP-AMP as stated in Theorem 4.1.1 is straightforward. Specifically, we replace

Part (b)(iii) in [104, Lemma 4.4] to be the following:

78

P

(∣∣∣ 1n

n∑i=1

φb([b01]i, ..., [b

0L]i, ..., [b

t1]i, ..., [b

tL]i,wi)

− E[φb(σ

01Z

01 , ..., σ

0LZ

0L, ..., σ

t1Z

t1, ..., σ

tLZ

tL,W )

] ∣∣∣ ≥ ε) ≤ Ke−κnε2 ,where φb : RL(t+1)+1 → R is pseudo-Lipschitz, Ztl l∈[L],t≥0 is defined in (4.8), and W ∼ pw

is independent of Ztl l∈[L],t≥0. Using Lemma 4.1.3 and the concentration constants defined in

Section 4.1.3.2, we can perform similar induction proof as in [104], as well as in Section 2.4. We

omit the detailed proof here to avoid repetition.

4.1.4 Numerical Examples

In this section, we provide numerical results for C-MP-AMP for both the Gaussian matrix

setting and non-Gaussian matrix setting. In the Gaussian matrix setting, where state evolution

is justified rigorously in Theorem 4.1.1, we numerically verify state evolution and the properties

implied by state evolution. In the non-Gaussian matrix setting, where state evolution is not

justified for AMP or C-MP-AMP, we show numerical evidence that C-MP-AMP converges

when damping [98], which is commonly used in AMP for non-Gaussian matrices to improve

the convergence performance, is applied. In all simulations, entries of the unknown vector x are

independent realizations of a Bernoulli-Gaussian random variable X, which has density function

fx(x) = 0.9δ0(x) + 0.1 1√2πe−

12x2

. The measurement noise vector w has i.i.d. Gaussian N (0, σ2w)

entries, where σ2w depends on SNR as SNR := 10 log10

((NE[X2])/(nσ2

w)). The denoiser ηs,t is

defined as ηs,t(u) = E[X|X+τ s,tp Z = u], where Z ∼ N (0, 1) is independent of X. The noise level

τ s,tp is estimated by ‖zs,tp ‖/√n, which is implied by the proof for state evolution. All numerical

results are averaged over 50 realizations of (x,A,w).

Gaussian matrix setting: We first show that the MSE of C-MP-AMP is accurately pre-

dicted by state evolution when the matrix A has i.i.d. Gaussian entries with Ai,j ∼ N (0, 1/n). It

can be seen from Figure 4.1a that the MSE achieved by C-MP-AMP from simulations matches

that predicted by state evolution at every outer iteration s and inner iteration t for various

choices of numbers of inner iterations.

As we have discussed in Remark 4.1.1, the estimation error of C-MP-AMP is no worse than

that of AMP, which implies that C-MP-AMP can achieve the minimum mean squared error

(MMSE) of large random linear systems with i.i.d. Gaussian matrix [49, 99] whenever AMP

79

0 1 2 3 4 5 6 70

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Iteration

MS

E

SE #inner−iteration=1



simulation

(a) Verification of state evolution.

0.2 0.3 0.4 0.5 0.6 0.70

0.01

0.02

0.03

0.04

0.05

0.06

Measurement rate δ

MS

E

MMSE 10dB

MMSE 15dB

simulation

(b) Verification of MMSE achieving.

Figure 4.1 C-MP-AMP for Gaussian matrices.

achieves it.2 This point is verified in Figure 4.1b.

Non-Gaussian matrix setting: The non-Gaussian matrices used in our simulation model

the third order Taylor expansion of a function g : RJ → R. The first J columns contain i.i.d.

Gaussian entries, and the rest of the columns are obtained by taking element-wise products

of each pair of the J columns (second order terms) and each three of the J columns (third

order terms). Hence, the unknown vector x contains the coefficients in the Taylor expansion.

The matrix is normalized to have column norm equal to 1. In our simulations, both AMP

and C-MP-AMP have diverged with this type of matrix. We use damping as defined in (3.9) to

improve the empirical convergence performance. Damping for C-MP-AMP is performed at every

processor. That is, the update for xs,t+1l changes to xs,t+1

l = ληs,t(xs,tl + A∗l z

s,tl ) + (1 − λ)xs,tl ,

∀l ∈ [L], with some fixed λ ∈ (0, 1]. Figure 4.2 shows that with the same damping parameter λ,

C-MP-AMP with one inner iteration per outer iteration has the same average dynamics as AMP,

and that increasing the number of inner iterations can reduce the number of outer iterations,

which reduces the communication frequency between the fusion center and the processors while

achieving the same error.

2AMP can achieve the MMSE in the limit of large linear systems when the model parameters (n/N , signalto noise ratio, sparsity of the unknown x) are within a region [65].

80

0 10 20 30 40 50

Iteration

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

MS

E

AMP =0.1

AMP =0.2

AMP =0.3

C-MP-AMP =0.1 #inner-iteration=1

C-MP-AMP =0.1 #inner-iteration=2

Figure 4.2 C-MP-AMP for non-Gaussian matrices.

4.2 Two-Part Reconstruction Framework

In this section, we develop a two-part reconstruction framework for sparse signal recovery from

linear systems (1.2), where a fast algorithm is applied to perform partial support recovery in

Part 1, and a linear inverse algorithm is applied to complete the remaining problem in Part 2. To

exploit the advantages of the two-part framework, we propose the Noisy-Sudocodes algorithm

that performs two-part reconstruction of sparse signals in the presence of measurement noise.

Specifically, we design a fast algorithm for Part 1 of Noisy-Sudocodes that identifies the zero-

valued entries in the input signal from its noisy linear measurements. While many linear inverse

algorithms could be applied to Part 2, we use approximate message passing (AMP) as an

example to analyze the trade-off between runtime and reconstruction quality, and use a 1-

bit compressed sensing reconstruction algorithm to demonstrate the application of the two-

part framework to 1-bit compressed sensing, where each entry of the measurement vector y is

quantized using only one bit, hence binary-valued.

4.2.1 The Noisy-Sudocodes Algorithm

We design a fast algorithm to identify the zero-valued coordinates of the input signal, which is

suitable for Part 1. The Noisy-Sudocodes algorithm is then defined as a two-part algorithm that

applies the zero-identification algorithm to Part 1, and a linear inverse algorithm to Part 2.

Consider the linear inverse problem 1.2, where the input signal x is real-valued, and as-

sume that any subset of the nonzero coordinates of x does not sum up to zero. Similar to

81

Sudocodes [105], the measurements of Noisy-Sudocodes are acquired via a sparse and binary

measurement matrix A1 ∈ Rn1×N in Part 1 and a dense matrix A2 ∈ Rn2×N in Part 2, so that

in total n = n1 + n2 measurements are used. Denote the measurement noise in Parts 1 and 2

by w1 and w2, respectively. The noisy measurement systems in the two parts are given by:

Part 1: y1 = A1x + w1, (4.10)

Part 2: y2 = A2x + w2. (4.11)

Let x1 be the reconstructed signal in Part 1. Define Ωrowi and Ωcol

j as the support sets (sets of

indices of nonzeros) of the ith row and the jth column of A1, respectively, where i ∈ [n1] and

j ∈ [N ]. Let ε ≥ 0 be a constant that depends on the noise level.3 Define an index set that

contains the indices of small-magnitude measurements as

Ωy = i : |[y1]i| ≤ ε, i ∈ [n1]. (4.12)

The Noisy-Sudocodes algorithm proceeds as follows:

Part 1: The measurement vector y1 is acquired via (4.10), and thus each [y1]i is the

summation of a subset of coordinates of x that depends on Ωrowi . If there were no measurement

noise, as in the Sudocodes algorithm [105], then under our assumptions on the input x, a zero-

valued measurement can only be the summation of coordinates with zero-value. In other words,

if [y1]i = 0, then xΩrowi

= 0. However, in the presence of measurement noise, a measurement

is (very) unlikely to be precisely zero. Moreover, a small-magnitude measurement could have

measured a combination of multiple large-magnitude coordinates, though with small probability

p. Nevertheless, it is unlikely that a large-magnitude coefficient could appear in multiple small-

magnitude measurements (if p is small, then pn decreases quickly as n increases). The Noisy-

Sudocodes algorithm identifies a coefficient to be zero when it is involved in c or more small-

magnitude measurements, where c is a tuning parameter that governs the zero-identification

criterion.4 For those coordinates of x that do not satisfy the zero-identification criterion, we

record their indices in a set T. That is, T = j : |Ωcolj ∩ Ωy| < c, j ∈ [N ], where | · | denotes

cardinality. Unlike Sudocodes [105], in which some of the nonzero-valued coordinates can be

perfectly recovered in Part 1 because the measurements are noiseless, Noisy-Sudocodes leaves

the reconstruction of nonzero coordinates for Part 2, where a more reliable algorithm is applied.

Part 2: Solve the remaining reconstruction problem with a linear inverse algorithm R. The

3We will see how to optimize ε in Section 4.2.3.4We will see how to optimize c in Section 4.2.3.

82

Algorithm 2 Noisy-Sudocodes

Inputs: y1, ε, c, Ωcol (support sets of columns of A1), y2, A2

Initialization: x = 0, T = ∅, Ωy = ∅Part 1: Apply zero-identification criterion

for i = 1 : n1 doif |[y1]i| < εΩy = Ωy ∪ i

for j = 1 : N doif |Ωcol

j ∩ Ωy| < cT = T ∪ j

Part 2: Apply linear inverse algorithm RxT = R(y2, [A2]T)Outputs: x

distribution of the measurement matrix A2 depends on the algorithm R applied to Part 2.

Let xT denote the coordinates of x at the indices T, and [A2]T denote the submatrix formed

by selecting columns of A2 at column indices T. The measurement vector y2 is acquired via

(4.11). After receiving T from Part 1, Part 2 first generates [A2]T from A2. The linear inverse

algorithm F then takes [A2]T and y2, and computes x2, the reconstructed signal of xT.

We complete the reconstruction by assigning x2 to the final reconstructed signal x at indices

T. The Noisy-Sudocodes algorithm is summarized in Algorithm 2.

4.2.2 Performance Analysis

4.2.2.1 Assumptions

We analyze the Noisy-Sudocodes algorithm in a specific setting. The input signal x ∈ RN has

i.i.d. Bernoulli-Gaussian coordinates with density function fx(x) = (1 − s)δ0(x) + sN (0, 1),

where s ∈ (0, 1) is the sparsity rate.

Part 1: The sparse measurement matrix A1 ∈ Rn1×N has i.i.d. Bernoulli entries, P ([A1]i,j 6=0) = d

sN , where d is a tuning parameter.5 Entries of the measurement noise w1 are i.i.d. Gaussian

N (0, σ2w).

Part 2: The approximate message passing algorithm (AMP) is applied to Part 2. The

measurement matrix A2 ∈ Rn2×N has i.i.d. Gaussian coordinates, [A2]i,j ∼ N (0, 1/N). The

measurement noise w2 follows the same distribution as w1.

5We will see how to optimize d in Section 4.2.3.

83

In order to have identical signal to noise ratio (SNR) in Parts 1 and 2, namely, ‖A1x‖2/‖w1‖2 =

‖A2x‖2/‖w2‖2, the nonzero-valued entries of the Bernoulli matrix A1 are scaled by√

sd .

4.2.2.2 Analysis of Part 1

Because only Part 1 will be discussed in this subsection, we drop the subscripts that distinguish

Parts 1 and 2.

Asymptotic independence: The goal of Part 1 is to identify the zero-valued coordinates

of x. Two types of errors could occur in Part 1. The first is missed detection, which is defined

as MD = j : xj = 0, xj 6= 0, j ∈ [N ]. The second is false alarm, which is defined as FA = j :

xj 6= 0, xj = 0, j ∈ [N ]. Let Ii,jn,Ni=1,j=1 = Iij : i ∈ [n], j ∈ [N ] be a set of binary random

variables, where Ii,j = 1 if the following two conditions are satisfied: (i) |yi| < ε given that the

value of the jth coordinate is xj and that the jth coordinate is involved in yi; and (ii) Ai,j 6= 0,

which means that the jth coordinate is indeed involved in yi. Denoting P (Ii,j = 1) by Pε,d(xj),

we have

Pε,d(xj) = P

(|yi| < ε,Ai,j =

√s

d

∣∣∣∣xj)

=

N−1∑n=0

1

2

erf

ε−√

sdxj√

2(nsd + σ2

w

)−erf

−ε−√

sdxj√

2(nsd + σ2

w

)(N − 1

n

)(d

N

)n(1− d

N

)N−1−n d

sN,

(4.13)

where erf(x) = 2√π

∫ x0 e−t2dt is the error function. Further, define the sum of Ii,jn,Ni=1,j=1 along

i as

Sj =n∑i=1

Ii,j , j ∈ [N ]. (4.14)

We can now rewrite the zero-identification criterion as Sj ≥ c, and define the probability of

missed detection (PMD) and the probability of false alarm (PFA) as:

PMD = P (xj 6= 0 | xj = 0) = P (Sj < c | xj = 0) ,

PFA = P (xj = 0 | xj 6= 0) = P (Sj ≥ c | xj 6= 0) .

Note that there are subtle dependencies in y1. If a subset of nonzero coordinates of x is

involved in multiple coordinates of y1, then the magnitudes of those entries of y1 are dependent.

Therefore, for each j, Ii,jni=1 is not independent along i, and thus Sj is a sum of dependent

Bernoulli random variables. Nevertheless, the following lemma shows that dependencies in y1

84

vanish under certain conditions.

Lemma 4.2.1. Let the input signal and the measurement matrix of Part 1 be defined in Sec-

tion 4.2.2.1, and let Pε,d(xj) and Sj be defined in (4.13) and (4.14), respectively. In the limit

of large systems as the signal dimension N goes to infinity, for each j ∈ [N ], Sj converges to

SB in distribution, where SB ∼ Binomial(n, Pε,d(xj)).


The main point is that the joint characteristic function of y1 can be factorized as the product

of its marginal characteristic functions, which implies that entries of y1 are asymptotically

independent, and thus for each j we have that Ii,jni=1 is asymptotically independent along i.

Therefore, Sj converges to a sum of i.i.d. Bernoulli random variables in distribution.

Using Lemma 4.2.1, PMD and PFA can be calculated as follows:

PMD = P (xj 6= 0 | xj = 0) =c−1∑m=0

(n

m

)Pε,d(0)m (1− Pε,d(0))M−m , (4.15)

PFA(x) = P (xj = 0 | xj = x) = 1−c−1∑m=0

(n

m

)Pε,d(x)m (1− Pε,d(x))n−m ,

PFA =

∫ ∞−∞

PFA(x)1√2πe−

12x2

dx. (4.16)

We can now compute the quantities that might affect the performance of Part 2. The

expected length N and the expected sparsity rate s of xT can be calculated as:

N = NP (xj 6= 0) = N [(1− s)PMD + s(1− PFA)] ,

and

s =sN(1− PFA)

N=

s(1− PFA)

(1− s)PMD + s(1− PFA);

the distribution of xT, which is denoted by Pxj (x | j ∈ T), can be calculated as:

Pxj (x | j ∈ T) = P (xj = x | xj 6= 0) =(1− s)PMDδ(x)

(1− s)PMD + s(1− PFA)+

s(1− PFA(x)) 1√2πe−

12x2

(1− s)PMD + s(1− PFA);

(4.17)

the distribution of xFA, which is denoted by Pxj (x | j ∈ FA) can be calculated as:

Pxj (x | j ∈ FA) = P (xj = x | xj = 0, xj 6= 0) =PFA(x) 1√

2πe−

12x2

PFA;

85

103

104

105

10−2

10−1

N

Err

(MD

)

103

104

105

10−2

10−1

N

Err

(FA

)

Figure 4.3 Top: Relative error between the empirical and theoretical probability of missed detection.Bottom: Relative error between the empirical and theoretical probability of false alarm. (The theoreti-cal probabilities rely on the asymptotic independence result of Lemma 4.2.1.)

and the expected value of the norm of xFA can be calculated as

E[‖xFA‖2] = sN

∫ ∞−∞

x2PFA(x)1√2πe−

12x2

dx. (4.18)

Numerical verification: To numerically verify the asymptotic independence property, we

simulate Part 1 of Noisy-Sudocodes with different input lengths N , and record the empirical

probability of missed detection (P emMD) and the empirical probability of false alarm (P em

FA ), where

we remind the reader that the corresponding theoretical predictions PMD and PFA are given by

(4.15) and (4.16), and these predictions rely on the asymptotic independence result of Lemma

4.2.1. Define the relative error between PMD and P emMD as

Err(MD) =|PMD − P em

MD|PMD

;

the definition of Err(FA) is similar to that of Err(MD). We plot Err(MD) and Err(FA) as

functions of N in Figure 4.3. It is shown in Figure 4.3 that the error due to the independence

assumption in the measurements vanishes at a rate polynomial in N .

4.2.2.3 Noisy-Sudocodes with AMP in Part 2

Gaussianity of noise: Recall that Part 2 only considers the residual problem left over from

86

Part 1. That is, Part 2 only solves for x at the indices T. The missed detection errors in Part 1

result in the zero entries of xT, whereas the false alarm errors in Part 1 result in an extra noise

term for Part 2. The extra noise term is generated by vFA = [A2]FAxFA, where [A2]FA represents

the submatrix formed by selecting columns of A2 at the indices FA. The problem for Part 2 is

modeled as

y2 = [A2]FAxT + vFA + w2. (4.19)

Because vFA is a linear mixing of xFA, entries of vFA are not independent. Nevertheless, the

following lemma shows that vFA converges to an i.i.d. Gaussian random vector in distribution.

Lemma 4.2.2. Let vFA be defined in (4.19), and E[‖[x]FA‖2] be calculated in (4.18). The extra

noise term vFA converges to z in distribution, where z ∼ N (0, σ2FAI) and σ2

FA = E[‖[x]FA‖2]/N .


The main point is that vFA is a sum of i.i.d. random vectors, which converges to a mul-

tivariate Gaussian random vector in distribution. It can be shown that vFA has uncorrelated

coordinates. Therefore, vFA converges to an uncorrelated Gaussian random vector.

To numerically verify the Gaussianity of vFA, we plot the sample quantiles of vFA versus

theoretical quantiles from a normal distribution (uantile-quantile plot, QQ plot for short). It is

shown in the top panel of Figure 4.4a that the entries of vFA lie on a straight line in the QQ

plot, which implies that vFA is marginally Gaussian. Next, we test the empirical correlation

among the entries of vFA, and the resulting empirical correlation is 0.025, which is close to the

empirical correlation of an i.i.d. Gaussian random vector of the same length. Therefore, it is

verified that vFA converges to an i.i.d. Gaussian random vector.

Performance analysis with AMP in Part 2: To simplify notations, define y = y2,

A = [A2]T, x = xT, and w = vFA + w2. Problem (4.19) can now be rewritten as

y = Ax + w, (4.20)

where A ∈ Rn2×N has i.i.d. Gaussian entries, Ai,j ∼ N (0, 1/N), x ∈ RN is i.i.d. with xj ∼Pxj (x | j ∈ T) (4.17), and w is asymptotically i.i.d. Gaussian with zero mean and its variance

satisfies σ2w = E[‖[x]]FA‖2]/N + σ2

w, with σ2w being the variance of w2.

87

We now apply AMP to Part 2 as follows:

zt = y − Axt +zt−1

n2

N∑i=1

η′t−1

(N

n2A∗zt−1 + xt−1

), (4.21)

xt+1 = ηt

(N

n2A∗zt + xt

). (4.22)

Due to different measurement matrix normalization schemes, a scaling factor of N/n2 is applied

to the AMP updating equations (4.21) and (4.22). The state evolution sequence τ2t t≥0 is

iteratively defined as

τ2t+1 =

N

n2σ2z +

1

δE[(ηt(X + τtZ)−X)2

],

where X ∼ Pxj (x | j ∈ T) and Z ∼ N (0, 1) independent of X.

In order to approximate the MMSE estimation, let ηt be the Bayes-optimal denoiser. The

prior of xj is Pxj (x|j ∈ T). Note that when the true distribution of xT (4.17) is applied, AMP

with i.i.d. Gaussian measurement matrix yields the Bayes-optimal reconstruction for (4.20) in

the limit of large systems (i.e., n2, N → ∞ for constant δ) for a large region of parameters

(signal sparsity, measurement rate, and measurement noise) [65, 130].

We notice that x no longer follows a Bernoulli-Gaussian distribution due to the false alarm

errors in Part 1. A comparison between the distribution of the nonzero coordinates of x and

a standard normal distribution is shown in the bottom panel of Figure 4.4a. Significant dis-

crepancies appear in bins centered around x = 0, because most false alarm errors occur when

the coordinates have small magnitudes. Notice that the entire x is a sparse signal, which has

a probability mass at x = 0. We might think of x as a Bernoulli-Gaussian signal whose small-

magnitude coordinates are approximated as 0, which results in a loss of density around x = 0

and an increase in the probability mass at x = 0. It would be interesting to see how large the

performance gap would be if we approximate the prior of x by a Bernoulli-Gaussian distribu-

tion when calculating the conditional expectation, because a Bernoulli-Gaussian distribution

can simplify both the computation and the analysis.

Figure 4.4b compares the SDR, which is defined in (3.10), achieved by the theoretical pre-

diction and the numerical results for Noisy-Sudocodes with AMP in Part 2. The prediction

for Part 1 follows the analysis in Section 4.2.2.2, and the MMSE for Part 2 (4.20) applies the

replica method for a Bernoulli-Gaussian input [48, 97]. The empirical results contain: (i) zero-

identification in Part 1 followed by AMP with the Bernoulli-Gaussian prior for x in Part 2; (ii)

zero-identification in Part 1 followed by AMP with the true distribution of x in Part 2. Fig-

ure 4.4b verifies that it is reasonable to approximate Pxj (x|j ∈ T) (4.17) by a Bernoulli-Gaussian

88

−5 0 5−0.04

−0.02

0

0.02

0.04

Standard Normal Quantiles

Quantile

s o

f zF

A

−4 −3 −2 −1 0 1 2 3 40

0.01

0.02

0.03

0.04

0.05

x

Pro

babili

ty d

ensity d

istr

ibution

Histogram

True pdf

Gaussian pdf

(a) Top: QQ plot of vFA. Bottom: pdf of thenonzero-valued entries in x and N (0, 1).

0.2 0.25 0.3 0.35 0.4 0.45 0.5

Measurement rate

8

10

12

14

16

18

20

Sig

na

l to

Dis

tort

ion

Ra

tio

(d

B)

Theory 10dB

Theory 5dB

Bernoulli-Gaussian

True distribution

(b) Numerical verification of Bernoulli-Gaussianapproximation to the prior of xT (4.17).

Figure 4.4 Numerical verification of approximations made in the analysis of Part 2.

distribution; any deterioration in reconstruction quality seems minor.

4.2.3 Trade-Off between Runtime and Reconstruction Quality

The analysis of the Noisy-Sudocodes algorithm allows us to exploit the advantages provided

by its two-part nature. We notice that 4 parameters in the algorithm can be tuned to provide

different performances in runtime and reconstruction quality: (i) the parameter d that governs

the sparsity of A1; (ii) the threshold ε for defining small-magnitude measurements; (iii) the

parameter c that governs the zero-identification criterion; and (iv) the ratio r of the number of

measurements assigned to Part 1 and Part 2.

It is worth mentioning that the number of AMP iterations could also be tuned. Because

AMP is merely one possible example for the reconstruction algorithm R that can be applied

to Part 2, we leave out this tuning parameter in our analysis and fix the number of iterations

to be 20, within which AMP generally converges for the numerical settings considered in this

work.

Our goal is to find the parameters (d, ε, c, r) that optimize the trade-off between runtime and

reconstruction quality for a given measurement rate. Both runtime and reconstruction quality

are functions of (d, ε, c, r). We have seen how to evaluate the reconstruction quality in terms

of SDR (3.10) in Sections 4.2.2.2 and 4.2.2.3, and let us now model the runtime. Based on the

89

0 1 2 3 4 5 6 7

0.2

0.6

0.8

15

10

15

20

25

30

Runtime (sec)

0.4

Measurement rate

Sig

na

l to

Dis

tort

ion

Ra

tio

(d

B)

Figure 4.5 Trade-offs between reconstruction quality, measurement rate δ, and runtime of Noisy-Sudocodes with AMP in Part 2.

operations performed in the Noisy-Sudocodes algorithm, we model the runtime of Part 1 by

t1 = α1N + α2n1 + α3Nn1,

for some α = (α1, α2, α3). The runtime for Part 2 is modeled as

t2 = β1N + β2n2 + β3Nn2,

for some β = (β1, β2, β3).

We simulate Part 1 with several different values for N and n1, and α is acquired via data

fitting with a least square criterion. We obtain β in a similar way.

The SDR (3.10) of Noisy-Sudocodes is evaluated with different parameter values of (d, ε, c, r)

at measurement rates δ = n/N ∈ [0.2, 0.9]. Each set of parameters results in a different

(n1, n2, N), and thus different (t1, t2). The total runtime of Noisy-Sudocodes, t = t1 + t2, is

quantized to 30 quantization bins for each R, the optimal SDR corresponding to each quanti-

zation bin is the highest SDR achieved within that bin, and the parameters that lead to the

highest SDR are the optimal parameters.

90

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

n1/n

5

10

15

20

25

Sig

na

l to

Dis

tort

ion

Ra

tio

(d

B)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

n1/n

0

2

4

6R

un

tim

e (

se

c)

Prediction =0.5

Prediction =0.4

Prediction =0.3

Simulation

Figure 4.6 Numerical verification of the prediction for SDR (3.10) (top) and runtime (bottom) ofNoisy-Sudocodes with AMP in Part 2.

A plot of SDR as a function of runtime and measurement rate is shown in Figure 4.5. To

achieve low runtime, Part 1 needs to be aggressive in identifying zeros, which results in poor

reconstruction quality. In the low runtime region, we see a significant improvement in SDR with

a small increase in runtime. If we further increase the available runtime, then the high quality

algorithm AMP in Part 2 eventually dominates, and thus high SDR is achieved.

To numerically verify the correctness of our predictions of SDR and runtime, we sample some

points from Figure 4.5 and set up simulations that utilize the corresponding sets of parameters

(d, ε, c, r). Figure 4.6 shows that our predictions match the simulation results in both SDR and

runtime.

4.2.4 Application to 1-Bit Compressed Sensing

In the previous sections, we discussed Noisy-Sudocodes in settings where the measurements

are allowed to have infinite quantization resolution. We notice that the fast zero-identification

algorithm in Part 1 of Noisy-Sudocodes does not benefit from the high resolution measurements,

because we only need to know if the entries of y1 are greater or less than ε. In other words,

the measurements are implicitly quantized to a lower resolution when running Part 1. On the

one hand, we see that the fast Part 1 leads to some compromises in reconstruction quality in

settings where the measurements are unquantized. On the other hand, Part 1 is not penalized

by the loss of quantization resolution in the measurements. This observation naturally leads us

91

to apply Noisy-Sudocodes to the 1-bit compressed sensing (CS) framework [18].

In 1-bit CS [18, 55, 67, 94, 126, 127], the measurements are quantized to 1 bit per measure-

ment. The problem model for noiseless and noisy 1-bit CS can be formulated as

noiseless 1-bit CS: y = sign(Ax), (4.23)

noisy 1-bit CS: y = sign(Ax + w), (4.24)

where w is the measurement noise before quantization (pre-quantization noise), and

sign(x) =

−1, if x ≤ 0

+1, if x > 0.

It is interesting to notice that Part 1 of Noisy-Sudocodes motivates a new 1-bit quantizer

that performs magnitude quantization. In particular, we define our proposed 1-bit quantizer as:

yi =

−1, if |[Ax]i + wi| ≤ ε

+1, if |[Ax]i + wi| > ε. (4.25)

Note that the threshold ε = 0 when w = 0. If we redefine the index set Ωy (4.12) as

Ωy = i | |[y1]i| = −1, i ∈ [n1], (4.26)

then Algorithm 1 can be used to solve 1-bit CS reconstruction problems with Ωy defined in (4.26)

and a 1-bit CS algorithm in Part 2.

A possible 1-bit CS algorithm that can be utilized is binary iterative hard thresholding

(BIHT) [55]. BIHT often achieves better reconstruction performance than the previous 1-bit

CS algorithms in the noiseless 1-bit CS setting. We show by numerical results in the following

that Noisy-Sudocodes with BIHT in Part 2 (Sudo-BIHT) achieves better reconstruction quality

than directly applying BIHT. Moreover, Sudo-BIHT is substantially faster than BIHT.

We present simulation results that compare Sudo-BIHT and BIHT in terms of SDR (3.10)

and runtime in both noiseless and noisy 1-bit CS settings. Runtime is measured in seconds

on a Dell OPTIPLEX 9010 running an Intel(R) CoreTM i7-3770 with 16GB RAM, and the

environment is MATLAB R2012a.

The input signal x follows a Bernoulli-Gaussian distribution with sparsity rate s = 0.005.

Because the amplitude information of the measurements is lost due to 1-bit quantization, it

is usually assumed in the 1-bit CS framework that ‖x‖2 = 1. Let n1 and n2 be the number

92

of measurements for Parts 1 and 2 of Sudo-BIHT. Therefore, n = n1 + n2 is the number of

measurements for BIHT. The measurement rate δ = n/N is set to be within the range (0, 2),

which is the same range utilized in the paper where BIHT is proposed [55]. Note that in 1-bit CS,

we are interested in the number of quantization bits rather than the number of measurements.

Therefore, the measurement rate is allowed to be greater than 1. In our simulation, we choose

n1 such that more than 90 percent of the zero coefficients can be identified in Part 1. The

measurement matrix A1 ∈ Rn1×N is i.i.d. Bernoulli distributed with P([A1]i,j 6= 0) = dsN , where

the parameter d is determined numerically. Note that the nonzero entries of the Bernoulli matrix

are scaled by√

sNd in order to have the same input SNR as in BIHT. The matrix A2 ∈ Rn2×N

has i.i.d. Gaussian entries, [A2]i,j ∼ N (0, 1).

For BIHT, the measurement matrix A ∈ Rn×N has i.i.d. Gaussian entries, Ai,j ∼ N (0, 1).

Finally, the pre-quantization noise w, which we use in the noisy setting, is i.i.d. Gaussian

distributed with zero mean and its variance is 10−2.5.

Noiseless setting: BIHT-`1 [55], in which the `1-norm is utilized in the objective function of

the optimization problem solved by BIHT, is applied to the noiseless setting. The measurement

vector y1 for Part 1 of Sudo-BIHT is acquired via (4.25) with w = 0 and ε = 0, and the

measurement vectors y2 for Part 2 of Sudo-BIHT and y for BIHT are acquired via (4.23).

In the noiseless setting, if any entry [y1]i only measures zero coefficients, then [y1]i will be

strictly zero. Therefore, we set c = 1 in the zero-identification criterion. Note that Part 1 does

not introduce any error in the noiseless setting. We iterate over BIHT until the consistency

property of BIHT6 [18] is satisfied or the number of iterations reaches 100.

In the top panel of Figure 4.7a, we plot SDR as a function of the measurement rate δ.

The plot shows that Sudo-BIHT achieves slightly higher SDR than BIHT. As δ increases, the

SDR for both algorithms increases similarly. Note that the measurements acquired in noiseless

1-bit CS include quantization noise. The quantization noise explains why the SDR achieved

in the noiseless 1-bit CS setting is finite, whereas unquantized noiseless measurements yield

perfect reconstruction [22, 35]. In the bottom panel of Figure 4.7a, we plot SDR as a function

runtime. Note that Sudo-BIHT can achieve the same SDR as BIHT despite running an order

of magnitude faster.

Noisy setting: BIHT-`2 [55], in which the `2-norm is utilized in the objective function,

is applied to the noisy setting. Note that BIHT-`2 is more robust to pre-quantization noise

than BIHT-`1. The measurement vector y1 for Part 1 of Sudo-BIHT is acquired via (4.25)

6We say that the consistency property of BIHT [18] is satisfied if applying the measurement and quanti-zation system (4.23) and (4.24) to the reconstructed signal x yields the same measurements y as the originalmeasurements.

93

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Measurement rate

10

20

30

40

50

Sig

nal to

Dis

tort

ion R

atio (

dB

)

Sudo-BIHT noiseless

BIHT noiseless

100 101 102

Runtime (sec)

20

30

40

Sig

nal to

Dis

tort

ion R

atio (

dB

)

(a) Noiseless setting (c = 1, ε = 0.)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Measurement rate

0

10

20

30

Sig

nal to

Dis

tort

ion R

atio (

dB

)

Sudo-BIHT 130

BIHT 130

Sudo-BIHT 30

BIHT 30

10-1 100 101 102

Runtime (sec)

0

10

20

30

Sig

nal to

Dis

tort

ion R

atio (

dB

)

(b) Noisy setting (c = 3, ε = 0.08.)

Figure 4.7 Numerical results of Noisy-Sudocodes with BIHT in Part 2 in a noisy 1-bit CS setting. Inboth figures, Top: SDR as a function of measurement rate δ. Bottom: SDR as a function of runtime.(n1/N = 0.1, n2 = n− n1, N = 10, 000, s = 0.005, d = 0.8)

with ε > 0, and the measurement vectors y2 for Part 2 of Sudo-BIHT and y for BIHT are

acquired via (4.24). We set c = 3, d = 0.8, and ε = 0.08 in our simulations because they lead

to sufficiently good performance in the sense that Sudo-BIHT improves over BIHT in both

runtime and reconstruction quality.

The resulting SDR versus measurement rate δ is shown in the top panel of Figure 4.7b.

When the number of iterations for BIHT is 30 in both Part 2 of Sudo-BIHT and BIHT, Sudo-

BIHT yields better consistency and thus provides better reconstruction quality. With more

iterations, the SDR for both Sudo-BIHT and BIHT improves. The SDR curve of BIHT tends

to get closer to Sudo-BIHT as the number of iterations increases, because for Sudo-BIHT, the

error introduced in Part 1 cannot be corrected by Part 2. We notice that Sudo-BIHT with 130

BIHT iterations (red solid line with circles) improves over BIHT with 30 iterations (blue dotted

line with crosses) by roughly 5 dB for the same measurement rate, and the bottom panel of

Figure 4.7b shows that the red solid line with circles can be 5 dB above the blue dotted line

with crosses despite requiring approximately half of the runtime. In other words, problem size

reduction due to zero-identification in Part 1 allows BIHT in Part 2 to run more iterations to

improve reconstruction quality with reasonable runtime.

94

4.3 Conclusion

This chapter proposed and analyzed two methods for fast computation in large-scale linear

inverse problems. The first method is a multiprocessor implementation of the AMP algorithm

with column-wise partitioning of the measurement matrix, we called our algorithm C-MP-AMP.

Because in C-MP-AMP, processors only exchange “coded” information, C-MP-AMP may be

useful for applications where different processors cannot share their estimation results due to

privacy concerns. A state evolution analysis of C-MP-AMP was provided under the same model

assumptions as those in the analysis for AMP. More specifically, our result showed that the state

evolution sequences of C-MP-AMP converge to a state where the mean squared error is at least

as good as that for AMP. Moreover, with a specific communication schedule between the fusion

center and the processors, C-MP-AMP enjoys a speedup linear in the number of processors.

The second method is a two-part reconstruction framework that partitions the reconstruc-

tion process into two complementary parts. The partitioning leads to a trade-off between run-

time and reconstruction quality. Applications such as real-time audio or video processing, where

delay in time might be more undesirable than deterioration in quality, may benefit from our

method. A Noisy-Sudocodes algorithm was proposed within the two-part framework. Part 1 of

Noisy-Sudocodes detects zero-valued entries in the input signal, whereas various CS reconstruc-

tion algorithms can serve as candidates for Part 2. We analyzed the speed-quality trade-off of

Noisy-Sudocodes with AMP in Part 2 based on the theoretical characterization that we derived

for Part 1 and the well-established performance analysis of AMP. Moreover, numerical results

for Noisy-Sudocodes with our 1-bit magnitude-quantizer in Part 1 and BIHT in Part 2 im-

plied that Noisy-Sudocodes could be promising for algorithm design in 1-bit CS reconstruction

problems.

95

Chapter 5

Nonlinear Diffractive Imaging via

Optimization

1 In this chapter, we propose a new image reconstruction method called convergent inverse

scattering via optimization and regularization (CISOR) for fast and memory-efficient nonlinear

diffractive imaging, where the goal is to reconstruct the electric permittivity of an object from

scattered wave measurements. The key contributions of this work are (i) a novel nonconvex for-

mulation that precisely captures the fundamental nonlinear relationship between the scattered

wave and the permittivity contrast, while enabling fast and memory-efficient implementation,

as well as rigorous convergence analysis and (ii) extension of the proposed formulation and al-

gorithm, as well as the convergence analysis, to the 3D vectorial case, which makes our method

applicable in a broad range of engineering and scientific areas.

5.1 Related Work

Conventional methods for inverse scattering usually rely on linearizing the relationship between

the permittivity and the wave measurements. For example, the first Born approximation [16]

and the Rytov approximation [34] are linearization techniques commonly adopted in diffraction

tomography [21, 63, 68, 74, 112, 113]. Other imaging systems that are based on a linear forward

model include optical projection tomography (OPT), optical coherence tomography (OCT),

digital holography, and subsurface radar [20, 26, 27, 33, 59, 69, 75, 96, 106, 119]. One attractive

1The work in this chapter was joint with Ulugbek Kamilov, Hassan Mansour, Petros Boufounos, and DehongLiu [85, 86]; it was completed while the author was an intern with Mitsubishi Electric Research Laboratories(MERL).

96

aspect of linear methods is that the inverse problem can be formulated as a convex optimization

problem and solved by various efficient convex solvers [8, 14, 19, 91]. However, linear models

are highly inaccurate when the physical size of the object is large compared to the wavelength

of the incident wave or the permittivity contrast of the object compared to the background is

high [25].

Many methods that attempt to integrate the nonlinear relationship between the permittivity

contrast and the scattered wave have been proposed in the literature. The iterative lineariza-

tion (IL) method [11, 24] iteratively computes the forward model using the current estimated

permittivity contrast, and estimates the contrast using the field from the previously computed

forward model. Hence, each sub-problem at each iteration is a linear problem. Contrast source

inversion (CSI) [1, 13, 120] defines an auxiliary variable called the contrast source, which is the

product of the contrast and the field. CSI alternates between minimizing the contrast source and

the contrast itself. Hybrid methods (HM) [12, 89, 128] combine IL and CSI, aiming to benefit

from the advantages of both methods. A comprehensive comparison of these three methods can

be found in the review paper [89]. Recently, the idea of neural network unfolding has inspired a

class of methods that updates the estimates using error backpropagation [60–62, 76, 77]. While

such methods can in principle model the precise nonlinearity, in practice, the accuracy may be

limited by the availability of memory to store the iterates needed to perform unfolding.

A recent and independent work by Soubies et. al. [111] has considered the same nonconvex

formulation as in this work for the 2D scalar field setting. Our work differs from [111] in

the following two aspects. First, FISTA [8], which does not have convergence guarantees for

nonconvex problems, is applied in [111] as the nonconvex solver, whereas our method applies

our new variant of FISTA with a rigorous convergence analysis established here. Second, only

the 2D scalar field case has been considered in [111], whereas our work extends to the 3D

vectorial field case. (Please refer to Sections 5.2.1 and 5.2.2 for illustrations of the scalar field

setting and the vectorial field setting, respectively.)

5.2 Problem Formulation

5.2.1 Scalar Field Setting

The problem of inverse scattering based on the scalar theory of diffraction [16, 47] is described

as follows and illustrated in Figure 5.1. Let d = 2 or 3, suppose that an object is placed within

a bounded domain, Ω ⊂ Rd. The object is illuminated by some incident wave uin, and the

scattered wave usc is measured by the sensors placed in a sensing region Γ ⊂ Rd. Let u denote

97

𝜒(𝑟)

𝑟 = (𝑥, 𝑦)

Figure 5.1 Visual representation of the measurement scenario considered in this work. An object witha real permittivity contrast χ(r) is illuminated with an input wave uin(r), which interacts with theobject and results in the scattered wave usc at the sensor domain Γ ⊂ R2. The complex scattered waveis captured at the sensor and the algorithm proposed here is used for estimating the contrast χ.

the total field, which satisfies u(r) = uin(r) + usc(r),∀r ∈ Rd. The scalar Lippmann-Schwinger

equation [16] establishes the relationship between wave and permittivity contrast:

u(r) = uin(r) + k2

∫Ωg(r− r′)u(r′)χ(r′)dr′, ∀r ∈ Rd.

In the above, χ(r) = ε(r)− εb is the permittivity contrast, where ε(r) is the permittivity of the

object, εb is the permittivity of the background, and k = 2π/λ is the wavenumber in vacuum.

We assume that χ is real, or in other words, the object is lossless. The free-space Green’s

function is defined as follows:

g(r) =

−j4H

(1)0 (kb‖r‖), if d = 2

ejkb‖r‖

4π‖r‖ , if d = 3, (5.1)

where H(1)0 is the zero-order Hankel function of the first kind, and kb = k

√εb is the wavenumber

of the background medium. The corresponding discrete system is then

y = Hdiag(u)x + w, (5.2)

u = uin + Gdiag(x)u, (5.3)

where x ∈ RN , u ∈ CN , uin ∈ CN are N uniformly spaced samples of χ(r), u(r), and uin(r) on

Ω, respectively, and y ∈ Cn is the measured scattered wave at the sensors with measurement

error w ∈ Cn. The matrix H ∈ Cn×N is the discretization of Green’s function g(r − r′) with

98

𝑥

𝑦

𝑧

Γ

𝑂

Ω

Tx

TxRx𝜃

𝜙

𝐸inpolarization

polarization𝐸in

polarization

𝐸sc

Figure 5.2 The measurement scenario for the 3D case considered in this work. The object is placedwithin a bounded image domain Ω. The transmitter antennas (Tx) are placed on a sphere and are lin-early polarized. The arrows in the figure define the polarization direction. The receiver antennas (Rx)are placed in the sensor domain Γ within the x-y (azimuth) plane, and are linearly polarized along thez direction.

r ∈ Γ and r′ ∈ Ω, whereas G ∈ CN×N is the discretization of Green’s function with r, r′ ∈ Ω.

The inverse scattering problem is then to estimate x given y, H, G, and uin. Define

L := I−Gdiag(x). (5.4)

Note that (5.2) constrained by (5.3) defines a nonlinear inverse problem, because u depends on

x through u = L−1uin, where L is defined in (5.4). In this work, the linear method refers to the

formulation where u = uin, and the iterative linearization method replaces u in (5.2) with the

estimate u computed from (5.3) using the current estimate x, and assumes that u is a constant

w.r.t. x.

5.2.2 Vectorial Field Setting

The measurement scenario for the 3D vectorial field case follows that in [44] and is illustrated

in Figure 5.2. The fundamental object-wave relationship in case of electromagnetic wave obeys

Maxwell’s equations. For time-harmonic electromagnetic field and under the Silver-Muller radi-

ation condition, it can be shown that the solution to Maxwell’s equations is equivalent to that

99

of the following integral equation [28]:

~E(r) = ~Ein(r) + (k2I +∇∇·)∫

Ωg(r− r′)χ(r′) ~E(r′)dr′, (5.5)

which holds for all r ∈ R3. In (5.5), ~E(r) ∈ C3 is the electric field at spatial location r, which

is the sum of the incident field ~Ein(r) ∈ C3 and the scattered field ~Esc(r) ∈ C3. Same as in

Section 5.2.1, χ(r) is the permittivity contrast and k is the wavenumber in vacuum. The 3D

free-space scalar Green’s function g(r) is defined in (5.1).

As illustrated in Figure 5.2, the measurement in our problem is the scattered field measured

in the sensor region Γ,

~Esc(r) = (k2I +∇∇·)∫

Ωg(r− r′)χ(r′) ~E(r′)dr′, ∀r ∈ Γ, (5.6)

where the total field ~E in (5.6) is obtained by evaluating (5.5) at r ∈ Ω. In the experimental

setup considered in this work, Γ ∩ Ω = ∅, hence Green’s function is non-singular within the

integral region in (5.6). Therefore, we can conveniently move the gradient-divergence operator

∇∇· inside the integral. Then (5.6) becomes

~Esc(r) =

∫ΩG(r− r′)χ(r′) ~E(r′)dr′, ∀r ∈ Γ, (5.7)

where G(r− r′) = (k2I +∇∇)g(r− r′) is the dyadic Green’s function in free-space, which has

an explicit form:

G(r− r′) = k2

((3

k2d2− 3j

kd− 1

)(r− r′

d⊗ r− r′

d

)+

(1 +

j

kd− 1

k2d2

)I

)g(r− r′), (5.8)

where ⊗ denotes the Kronecker product, d = ‖r− r′‖, and I is the unit dyadic.

To obtain a discrete system for (5.5) and (5.7), we define the image domain Ω as a cube and

uniformly sample Ω on a rectangular grid with sampling step ∆ in all three dimensions. Let the

center of Ω be the origin and let rl,s,t := ((l−M/2−0.5)∆, (s−M/2−0.5)∆, (t−M/2−0.5)∆)

for l, s, t = 1, . . . ,M . To simplify the notation, we use a one-to-one mapping (l, s, t) 7→ a and

denote the samples by ra for a = 1, . . . , N , where N = M3. Assume that n detectors are placed

at rb for b = 1, . . . , n. Note that the detectors do not have to be placed on a regular grid,

because the dyadic Green’s function (5.8) can be evaluated at any spatial points, whereas the

100

rectangular grid on Ω is important in order for discrete differentiation to be easily defined.

Using these definitions, the discrete system that corresponds to (5.7) and (5.5) with r ∈ Ω can

then be written as

~Esc(rb) = ∆3N∑a=1

G(rb − ra)χ(ra) ~E(ra), (5.9)

~E(ra) = ~Ein(ra) + (k2 +∇∇·) ~B(ra), (5.10)

where a = 1, . . . , N , b = 1, . . . , n, and

~B(ra) = ∆3N∑c=1

g(ra − rc)χ(rc) ~E(rc), a = 1, . . . , N.

Let us organize ~Ea and ~Eina for a = 1, . . . , N into column vectors as follows:

u =

E(1)

E(2)

E(3)

, uin =

Ein,(1)

Ein,(2)

Ein,(3)

, (5.11)

where E(i) ∈ CN is a column vector whose ath coordinate is E(i)a ; similar notation applies to

Ein,(i) ∈ CN . Then the discretized inverse scattering problem is defined as follows:

y = H (I3 ⊗ diag(x)) u + w, (5.12)

u = uin + (k2I + D)(I3 ⊗ (Gdiag(x))

)u, (5.13)

where Ip is a p × p identity matrix and we drop the subscript p if the dimension is clear from

the context, x ∈ RN is the vectorized permittivity contrast distribution, which is assumed to be

real in this work, D ∈ R3N×3N is the matrix representation of the gradient-divergence operator

∇∇·, y ∈ Cn is the scattered wave measurement vector with measurement noise e ∈ Cn, and

G ∈ CN×N is the matrix representation of the convolution operator induced by the 3D free

space Green’s function (5.1). Note that according to the polarization of the receiver antennas in

the experimental setup we consider in this work, the ideal noise-free measurements yb − eb are

Esc,(3)b , for b = 1, ..., n, which is the scattered wave ~Esc along the z-dimension measured at n

different locations in the sensor region Γ. Therefore, H ∈ Cn×3N is the matrix representation of

101

the convolution operator induced by the third row of the dyadic Green’s function (5.8). Define

L := I− (k2I + D)(I3 ⊗ (Gdiag(x))

). (5.14)

Similar to the scalar field case, we can see that the measurement vector y is nonlinear in the

unknown x, because u depends on x according to u = L−1uin, which follows from (5.13).

Next, we define the discrete gradient-divergence operator ∇∇·, for which the matrix repre-

sentation is D in (5.13). For a scalar function f of the spatial location rl,s,t, denote f(rl,s,t) by

fl,s,t. Following the finite difference rule,

∂2

∂x2fl,s,t :=

fl−1,s,t − 2fl,s,t + fl+1,s,t

∆2,

∂2

∂x∂yfl,s,t :=

fl−1,s−1,t − fl−1,s+1,t

4∆2+fl+1,s+1,t − fl+1,s−1,t

4∆2.

The definitions of ∂2

∂x∂z and ∂2

∂y∂z are similar to that of ∂2

∂x∂y . Moreover, for a vector ~El,s,t ∈ C3,

we use E(i)l,s,t to denote its ith coordinate. Then we have that the first coordinate of ~El,s,t is

E(1)l,s,t = E

in,(1)l,s,t + k2B

(1)l,s,t +

∂

∂x1

(∂

∂x1B

(1)l,s,t +

∂

∂x2B

(2)l,s,t +

∂

∂x3B

(3)l,s,t

)= E

in,(1)l,s,t + k2B

(1)l,s,t +

B(1)l+1,s,t − 2B

(1)l,s,t +B

(1)l−1,s,t

∆2

+B

(2)l−1,s−1,t −B

(2)l−1,s+1,t −B

(2)l+1,s−1,t +B

(2)l+1,s+1,t

4∆2

+B

(3)l−1,s,t−1 −B

(3)l−1,s,t+1 −B

(3)l+1,s,t−1 +B

(3)l+1,s,t+1

4∆2.

The second and third coordinates of ~El,s,t can be obtained in a similar way.

5.2.3 Nonconvex Optimization Formulation

To estimate x from the nonlinear inverse problem (5.2) and (5.3), we consider the following

nonconvex optimization formulation. Define

A(x) := Hdiag(u)x, (5.15)

which is the (clean) scattered wave from the object with permittivity contrast x. Moreover, let

B ⊂ RN be a bounded convex set, which contains all possible values that x may take. Then x is

102

estimated by minimizing the following composite cost function with a nonconvex data-fidelity

term D(x) and a convex regularization term R(x):

x∗ = arg minx∈RN

F(x) := D(x) +R(x) , (5.16)

where

D(x) =1

2‖y −A(x)‖22, (5.17)

R(x) = τ

N∑a=1

√√√√ 2∑d=1

|[Qix]a|2 + CB(x). (5.18)

In (5.18), Qi is the discrete gradient operator in the ith dimension, hence the first term in R(x)

is the total variation (TV) [102] cost, and the parameter τ > 0 controls the contribution of the

TV cost to the total cost. The second term CB(·) is defined as

CB(x) :=

0, if x ∈ B

∞, if x 6∈ B.

Note that D(·) is differentiable if L is non-singular and R(·) is proper, convex, and closed (lower

semi-continuous) if B is convex and closed.2

Similarly, in the vectorial field setting, to estimate x from (5.12) and (5.13), we only need

to change the nonlinear operator A in the data-fidelity term (5.17) to be

A(x) := H (I3 ⊗ diag(x)) u. (5.19)

5.3 Proposed Method

As mentioned before, for a composite cost function comprised of a smooth term and a convex

nonsmooth term such as (5.16), the class of proximal gradient methods, including ISTA [10,

32, 43] and FISTA [8], can be applied. Our image reconstruction method CISOR is based on

the class of proximal gradient methods, where we provide an explicit formula for the gradient

of the data-fidelity term (5.17) and a fast, memory-efficient way to evaluate the gradient.

We notice that ISTA is empirically slow and FISTA has only been proved to converge for

convex problems. A variant of FISTA has been proposed in [71] for nonconvex optimization

2A function is said to be proper if it never attains −∞.

103

with convergence guarantees. This algorithm computes two estimates from ISTA and FISTA,

respectively, at each iteration, and selects the one with lower objective function value as the final

estimate at the current iteration. Therefore, both the gradient and the objective function need

to be evaluated at two different points at each iteration. While such extra computation may be

insignificant in some applications, it can be prohibitive in the inverse scattering problem, where

additional evaluations of the gradient and the objective function require the computation of

the entire forward model.

In the following, we propose a simple modification to FISTA, called relaxed FISTA. The

convergence analysis of relaxed FISTA as the solver in CISOR is presented in Appendix D.3.2.

Note that one may choose to use other proximal gradient methods with convergence guarantees

for nonsmooth and nonconvex problems in CISOR. We use relaxed FISTA in our implementation

because it is simple and yields good empirical performance as evidenced in 5.4.

Before introducing our algorithm, we include a definition of the proximal operator.

Definition 5.3.1 (proximal operator). Let f : Rm → (−∞,∞] be convex and lower semi-

continuous, and let the scalar γ be strictly positive. The proximal operator of f with parameter

γ, written as Proxγf , is defined as

Proxγf (y) = arg minx∈Rm

f(x) +1

2γ‖y − x‖2, ∀y ∈ Rm. (5.20)

We now introduce our new variant of FISTA. Starting with some initialization x0 ∈ RN and

setting s1 = x0, θ0 = 1, α ∈ [0, 1), for t ≥ 1, the proposed algorithm proceeds as follows:

xt = ProxγR(st − γ∇D(st)

)(5.21)

θt+1 =

√4θ2t + 1 + 1

2(5.22)

st+1 = xt + α

(θt − 1

θt+1

)(xt − xt−1), (5.23)

where the choice of the step-size γ to ensure convergence will be discussed in Appendix D.3.2.

Notice that the algorithm (5.21)-(5.23) is equivalent to ISTA when α = 0 and is equivalent to

FISTA when α = 1. For this reason, we call our proposed algorithm relaxed FISTA. Figure

5.3 shows that the empirical convergence speed of relaxed FISTA improves as α increases from

0 to 1. The plot was obtained using the experimentally measured scattered microwave data

collected by the Fresnel Institute [45]. Our theoretical analysis of relaxed FISTA in Appendix

D.3.2 establishes convergence for any α ∈ [0, 1) with appropriate choice of the step-size γ.

104

100 200 300 400 500Iteration

10-1

100

Cos

t Fun

ctio

n V

alue =0 (ISTA)

=0.88=0.96=1 (FISTA)

Figure 5.3 Empirical convergence speed for relaxed FISTA with various α values tested on experi-mentally measured data.

The two main elements of relaxed FISTA are the computation of the gradient ∇D of the

data-fidelity term D defined in (5.17) and the proximal mapping ProxγR (5.21). Given ∇D(st),

the proximal mapping for constrained TV (5.18) can be efficiently solved by TV-FISTA [9]. The

following two propositions provide explicit formulas for ∇D in the scalar field and vectorial field

cases, respectively, which enable fast and memory-efficient computation of ∇D. This method is

also known as the adjoint state method [95].

Proposition 5.3.1. Let the nonlinear operator A be defined in (5.15) and the matrix L be

defined in (5.4). Define z := A(x) − y. Then the gradient of the D defined in (5.17) can be

written as

∇D(x) = Re

diag(u)H(HHz + GHv

), (5.24)

where u and v are obtained from the linear systems

Lu = uin, and LHv = diag(x)HHw. (5.25)

Proof. See Appendix D.1.

Proposition 5.3.2. Let the nonlinear operator A be defined in (5.19) and the matrix L be

defined in (5.14). Define z := A(x)− y, and

g = diag(u)H(HHz + (I3 ⊗ GH)(k2I + DH)v

), (5.26)

where u and v are obtained from the linear systems

Lu = uin, and LHv = (I3 ⊗ diag(x))HHz. (5.27)

105

Then the gradient of the D defined in (5.17) can be written as

∇D(x) = Re

3∑i=1

g(i)

, (5.28)

where g(i) = (g(i−1)N+1, . . . , giN ) ∈ CN for i = 1, 2, 3.

Proof. See Appendix D.2.

Note that in the above, u and v, as well as u and v, can be efficiently solved by the

conjugate gradient method. In our implementation, L and L are operators rather than explicit

matrices, and the convolution with Green’s function in L and L is computed using the fast

Fourier transform (FFT) algorithm.

5.4 Experimental Results

We now compare our method CISOR to several state-of-the-art methods, including iterative

linearization (IL) [11, 24], contrast source inversion (CSI) [1, 13, 120], and SEAGLE [76], as

well as a conventional linear method, the first Born approximation (FB) [16]. All algorithms use

additive total variation regularization. In our implementation, CSI uses Polak-Ribiere conjugate

gradient. CISOR uses the relaxed FISTA defined in Section 5.3 with α = 0.96 and fixed step-size

γ, which is manually tuned. The other methods use the standard FISTA [8], also with manually

tuned and fixed step-sizes.

Comparison on simulated data. In this experiment, the wavelength of the incident

wave is 7.49 cm. Define the contrast of an object with permittivity contrast distribution x as

max(|x|). We consider the Shepp-Logan phantom and change its contrast to the desired value

to obtain the ground-truth xtrue. We then solve the Lippmann-Schwinger equation to generate

the scattered waves that are then used as measurements. The center of the image is the origin

and the physical size of the image is set to 120 cm × 120 cm. Two linear detectors are placed

on two opposite sides of the image at a distance of 95.9 cm from the origin. Each detector has

169 sensors with a spacing of 3.84 cm. The transmitters are placed on a line 48.0 cm to the left

of the left detector, and they are spaced uniformly in azimuth w.r.t. the origin within a range of

[−60, 60] at every 5. The reconstructed SNR, which is defined as 20 log10(‖xtrue‖/‖x−xtrue‖),is used as the comparison criterion. The size of the reconstructed images is 128×128 pixels. For

each contrast value and each algorithm, we run the algorithm with five different regularization

parameter values and select the result that yields the highest reconstructed SNR.

106

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35Contrast Value

0

5

10

15

20

SN

R [d

B]

CISORCSIILFB

Figure 5.4 Comparison of different reconstruction methods for various contrast levels tested on simu-lated data.

Figure 5.4 shows that as the contrast increases, the reconstructed SNR of FB and IL de-

creases, whereas that of CSI and CISOR is more stable. While it is possible to further improve

the reconstructed SNR for CSI by running more iterations, CSI is known to be slow as the com-

parisons shown in [89]. Figure 5.5 provides a visual comparison between FB, IL, and CISOR

for this simulated experimental setup.

Comparison on experimental data. We test our method in both 2D and 3D settings

using the public dataset provided by the Fresnel Institute [44, 45]. Two objects for the 2D

setting, FoamDielExtTM and FoamDielintTM, and two objects for the 3D setting, TwoSpheres

and TwoCubes, are considered.

In the 2D setting, the objects are placed within a 150 mm × 150 mm square region centered

at the origin of the coordinate system. The number of transmitters is 8 and the number of

receivers is 360 for all objects. The transmitters and the receivers are placed on a circle centered

at the origin with radius 1.67 m and are spaced uniformly in azimuth. Only one transmitter is

turned on at a time, and only 241 receivers are active for each transmitter. That is, the 119

receivers that are closest to a transmitter are inactive for that transmitter. While the dataset

contains multiple frequency measurements, we only use the ones corresponding to 3 GHz, hence

the wavelength of the incident wave is 99.9 mm. The pixel size of the reconstructed images is

1.2 mm.

In the 3D setting, the transmitters are located on a sphere with radius 1.769 m. The az-

imuthal angle θ ranges from 20 to 340 with a step of 40, and the polar angle φ ranges from

30 to 150 with a step of 15. The receivers are only placed on a circle with radius 1.769 m

in the azimuthal plane with azimuthal angle ranging from 0 to 350 with a step of 10. Only

the receivers that are more than 50 away from a transmitter are active for that transmitter.

A visual representation of this setup is shown in Figure 5.2. We use the data corresponding to

4 GHz for the TwoSpheres object and 6 GHz for the TwoCubes object, hence the wavelengths

107

IL

FB

𝜆

CISOR

0.2% 8.2% 16.2% 24% 32.2%

Figure 5.5 From top to bottom: Reconstructed images obtained by FB, IL, and CISOR. Each columnrepresents one contrast value as indicated at the bottom of the images on the third row. CISOR isstable for all tested contrast values, whereas FB and IL fail for large contrast.

of the incident waves are 74.9 mm and 50.0 mm, respectively. The pixel size is 4.7 mm for the

TwoSpheres and 3.1 mm for TwoCubes.

Figure 5.6 provides a visual comparison of the reconstructed images obtained by different

algorithms for the 2D data. For each object and each algorithm, we run the algorithm with five

different regularization parameter values and select the result that has the best visual quality.

Figure 5.6 shows that all nonlinear methods CISOR, SEAGLE, IL, and CSI obtained reasonable

reconstruction results in terms of both the contrast value and the shape of the object, whereas

the linear method FB significantly underestimated the contrast value and failed to capture

the shape. These results demonstrate that the proposed method is competitive with several

state-of-the-art methods.

Figures 5.7 and 5.8 present the results for the TwoSpheres and the TwoCubes objects,

respectively. Again, the results show that CISOR is competitive with state-of-the-art methods

for the 3D vectorial setting as well.

5.5 Conclusion

In this chapter, we proposed an image reconstruction algorithm called CISOR for nonlinear

diffractive imaging in 2D and 3D scalar field settings, as well as the 3D vectorial field setting.

108

λ 2.2

0

0.5

0

CISOR IL

0.5

0

FB

2.2

0

λ CSISEAGLE

Figure 5.6 Images reconstructed by different algorithms from experimentally measured data for 2Dobjects. The first and second rows use the FoamDielExtTM and the FoamDielIntTM objects, respec-tively. From left to right: ground truth, images reconstructed by CISOR, SEAGLE, IL, CSI, and FB.The color-map for FB is different from the rest, because FB significantly underestimated the contrastvalue. The size of the reconstructed objects are 128× 128 pixels.

1.76𝜆

0

𝑦

𝑥

𝑧

𝑥

𝑧=0 mm

𝑦=0 mm

𝑧

𝑦 𝑥=-25 mm

ILCISOR CSI

Figure 5.7 Images reconstructed by CISOR, IL, and CSI from experimentally measured data for theTwoShperes object. From left to right: ground truth, image slices reconstructed by CISOR, IL, andCSI, the reconstructed contrast distribution along the dashed lines showing in the image slices on thefirst column. From top to bottom: image slices parallel to the x-y plane with z = 0 mm, parallel tothe x-z plane with y = 0 mm, and parallel to the y-z plane with x = −25 mm. The size of the recon-structed objects are 32× 32× 32 pixels for a 150× 150× 150 mm cube centered at (0, 0, 0).

109

0

1.43𝜆

𝑦

𝑥 𝑧=33 mm

𝑧

𝑥 𝑦=-17 mm

𝑧

𝑦 𝑥=17 mm

ILCISOR CSI

Figure 5.8 Images reconstructed by CISOR, IL, and CSI from experimentally measured data for theTwoCubes object. From left to right: ground truth, image slices reconstructed by CISOR, IL, and CSI,the reconstructed contrast distribution along the dashed lines showing in the image slices on the firstcolumn. From top to bottom: image slices parallel to the x-y plane with z = 33 mm, parallel to thex-z plane with y = −17 mm, and parallel to the y-z plane with x = 17 mm. The size of the recon-structed objects are 32× 32× 32 pixels for a 100× 100× 100 mm cube centered at (0, 0, 50) mm.

110

The nonlinearity considered in this work is due to the fundamental relationship between the

scattered wave and the permittivity of an object, rather than that introduced by limitations

of the sensing system such as missing phase; our measurements have full information about

amplitude and phase, though they may be contaminated by measurement noise.

CISOR estimates the permittivity contrast by iteratively minimizing a composite cost func-

tion that consists of a smooth quadratic data-fidelity term and a nonsmooth total variation

regularization term using the class of proximal gradient methods. An explicit formula for the

gradient of the smooth term w.r.t. the permittivity contrast was provided, which enables fast

and memory-efficient computation of the gradient while precisely modeling the fundamental

nonlinearity.

The data-fidelity term in our cost function is nonconvex due to the nonlinearity. To reliably

solve this nonconvex problem, we proposed a relaxed variant of FISTA, which is a fast convex

solver whose convergence has been well-understood for convex problems, and provided conver-

gence guarantees for our relaxed FISTA for general composite cost functions comprised of a

smooth but nonconvex term with Lipschitz gradient and a convex but nonsmooth term whose

proximal mapping is easily computed.

Numerical results demonstrated that CISOR is competitive with several state-of-the-art

methods. Two key advantages of CISOR over other methods are in its memory efficiency and

convergence guarantees.

111

Chapter 6

Discussion

This dissertation studied several linear and nonlinear inverse problems in computational sensing.

In the following, we summarize our key results and discuss future work.

Extensions of approximate message passing (AMP): We explored the potential of the AMP

algorithmic framework for reconstructing signals that may have dependencies among entries. In

particular, we studied the state evolution analysis as well as the empirical performance of AMP

when non-separable denoisers are applied. Chapter 2 rigorously justified the state evolution

analysis when the measurement matrix has i.i.d. Gaussian entries, the measurement noise is

i.i.d. sub-Gaussian, the input signal is a Markov random field defined on a rectangular lattice,

and the denoisers are Lipschitz non-separable sliding-window denoisers (see Theorem 2.2.1).

For the special case of 1D signals with Markov chain priors, we provided an alternative state

evolution analysis (see Theorem 2.5.1) with the definition of the state evolution sequences being

similar in form to the ones in the original analysis of AMP, where the input signal is assumed

to have i.i.d. sub-Gaussian entries and the denoisers are separable.

State evolution analysis of AMP implies that the argument of the denoiser at each iteration

of the algorithm is close in distribution to the input signal plus i.i.d. Gaussian noise. Inspired by

this property, we designed AMP-based algorithms for universal compressed sensing (imaging)

and hyperspectral imaging in Chapter 3. Promising numerical results are strong motivations

for further study of AMP.

Current state evolution analyses of AMP highly rely on the properties of i.i.d. Gaussian

measurement matrices. When the AMP algorithm is applied to linear inverse problem with

other matrix ensembles, AMP has been shown empirically to diverge in some cases. As for

future work, we would like to design new message passing and approximate message passing

algorithms and analyze their performance by combining ideas from the advanced developments

112

in mean-field theory and mathematical optimization, since both areas are closely related to

the current AMP algorithm, hence are likely to provide insights for new algorithm design and

analysis.

Fast computation for large-scale problems: A state evolution analysis was provided

for a multiprocessor implementation of AMP, where subsets of columns of the measurement

matrix are stored in different processors (see Theorem 4.1.1); we called this implementation

column-wise multiprocessor AMP (C-MP-AMP). We proved that the state evolution sequences

for C-MP-AMP converges to a state where the mean squared error is at least as good as that

of AMP. In addition to speeding up the computation, C-MP-AMP may also be useful when

different processors are not allowed to share their estimates due to possible privacy concerns,

because the vectors exchanged between processors are linearly mixed version of the estimates

in C-MP-AMP.

For the scenario where the input signal has a large number of zero-valued entries, we pro-

posed a two-part reconstruction framework, which detects the zero-valued entries using a sparse

measurement matrix and a fast algorithm in Part 1, and then in Part 2, the remaining entries

are reconstructed using a high-fidelity algorithm and a dense measurement matrix. The Noisy-

Sudocodes algorithm was proposed as an example of the two-part framework, where we designed

a simple and fast partial support recovery algorithm for Part 1. A trade-off analysis of speed

and reconstruction quality was provided for Noisy-Sudocodes when AMP is applied in Part

2. Moreover, empirical results demonstrated that with the binary iterative hard thresholding

(BIHT) algorithm in Part 2, Noisy-Sudocodes achieved promising reconstruction results for

1-bit compressed sensing problems.

Reliable nonlinear diffractive imaging: To accurately characterize the relationship be-

tween the permittivity contrast of an object and the scattered wave measurements, nonlinear

forward models need to be used for strongly scattering objects. Image reconstruction using a

nonlinear forward model is challenging, because even a quadratic data-fidelity term is a noncon-

vex function of the permittivity contrast. Our proposed method, convergent inverse scattering

via optimization and regularization (CISOR), uses the class of proximal gradient methods to

minimize a composite cost function that consists of the nonconvex data-fidelity term and a total

variation regularization term. In order to achieve reliable reconstruction, we proposed a relaxed

variant of fast iterative shrinkage/thresholding algorithm (FISTA), and provided its convergence

guarantees for the class of nonconvex and nonsmooth optimization problems (see Proposition

D.3.1); we called our algorithm relaxed FISTA. Explicit formulas were provided for the gradi-

ent of the data-fidelity term w.r.t. the permittivity contrast in both scalar and vectorial field

settings (see Proposition 5.3.1 and Proposition 5.3.2), which enable fast and memory-efficient

113

computation of the gradient at each iteration of relaxed FISTA. Using the explicit formula for

the gradient, we showed that the gradient is Lipschitz (see Proposition D.3.2). Combining the

Lipschitz property and the convergence guarantee established for relaxed FISTA, we obtained

the convergence analysis of CISOR. Empirical results on both simulated and experimentally

measured data showed that CISOR is competitive with several state-of-the-art methods in terms

of reconstruction quality, while enjoying the advantages of memory-efficiency and convergence

guarantees.

As for future work, we would like to study the convergence rate of relaxed FISTA in both

convex and nonconvex settings, as well as the convergence of its iterates. Moreover, we would like

to improve our implementation of CISOR and find real-world imaging systems where CISOR

may be suitable.

114

BIBLIOGRAPHY

[1] A. Abubakar, P. M. van den Berg, and T. M. Habashy. “Application of the multiplica-tive regularized contrast source inversion method TM- and TE-polarized experimentalFresnel data”. In: Inv. Probl. 21.6 (2005), S5–S14.

[2] D. Applebaum et al. Quantum Independent Increment Processes I: From Classical Prob-ability to Quantum Stochastic Calculus (Lecture Notes in Mathematics). New York, NY,USA: Springer, 2005.

[3] H. Arguello and G. Arce. “Colored Coded aperture design by concentration of measurein compressive spectral imaging”. In: IEEE Trans. Image Process. 23.4 (2014), pp. 1896–1908.

[4] H. Arguello et al. “Higher-order computational model for coded aperture spectral imag-ing”. In: Appl. Optics 52.10 (2013), pp. D12–D21.

[5] D. Baron and M. F. Duarte. “Universal MAP estimation in compressed sensing”. In:Proc. Allerton Conf. Commun., Control, and Comput. 2011, pp. 768–775.

[6] A. Barron, J. Rissanen, and B. Yu. “The minimum description length principle in codingand modeling”. In: IEEE Trans. Inf. Theory 44.6 (1998), pp. 2743–2760.

[7] M. Bayati and A. Montanari. “The dynamics of message passing on dense graphs, withapplications to compressed sensing”. In: IEEE Trans. Inf. Theory 57.2 (2011), pp. 764–785.

[8] A. Beck and M. Teboulle. “A Fast Iterative Shrinkage-Thresholding Algorithm for LinearInverse Problems”. In: SIAM J. Imaging Sciences 2.1 (2009), pp. 183–202.

[9] A. Beck and M. Teboulle. “Fast Gradient-Based Algorithm for Constrained Total Varia-tion Image Denoising and Deblurring Problems”. In: IEEE Trans. Image Process. 18.11(2009), pp. 2419–2434.

[10] J. Bect et al. “A `1-unified variational framework for image restoration”. In: Proc. ECCV.Ed. by Springer. Vol. 3024. New York, 2004, pp. 1–13.

[11] K. Belkebir, P. C. Chaumet, and A. Sentenac. “Superresolution in total internal reflectiontomography”. In: J. Opt. Soc. Am. A 22.9 (2005), pp. 1889–1897.

[12] K. Belkebir and A. Sentenac. “High-resolution optical diffraction microscopy”. In: J.Opt. Soc. Am. A 20.7 (2003), pp. 1223–1229.

[13] M. T. Bevacquad et al. “Non-linear Inverse Scattering via Sparsity Regularized ContrastSource Inversion”. In: IEEE Trans. Comp. Imag. (2017).

115

[14] J. M. Bioucas-Dias and M. A. T. Figueiredo. “A New TwIST: Two-Step Iterative Shrink-age/Thresholding Algorithms for Image Restoration”. In: IEEE Trans. Image Process.16.12 (2007), pp. 2992–3004.

[15] J. M. Bioucas-Dias and M. A. Figueiredo. “A new TwIST: Two-step iterative shrink-age/thresholding algorithms for image restoration”. In: IEEE Trans. Image Process.16.12 (2007), pp. 2992–3004.

[16] M. Born and E. Wolf. “Principles of Optics”. In: 7th ed. Cambridge Univ. Press, 2003.Chap. Scattering from inhomogeneous media, pp. 695–734.

[17] S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: A nonasymptotictheory of independence. OUP Oxford, 2013.

[18] P. Boufounos and R. Baraniuk. “1-bit compressive sensing”. In: Proc. 2008 Conf. Inf.Sciences Systems. 2008, pp. 16–21.

[19] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Univ. Press, 2004.

[20] D. J. Brady et al. “Compressive Holography”. In: Opt. Express 17.15 (2009), pp. 13040–13049.

[21] M. M. Bronstein et al. “Reconstruction in Diffraction Ultrasound Tomography UsingNonuniform FFT”. In: IEEE Trans. Med. Imag. 21.11 (2002), pp. 1395–1401.

[22] E. Candes, J. Romberg, and T. Tao. “Robust uncertainty principles: Exact signal recon-struction from highly incomplete frequency information”. In: IEEE Trans. Inf. Theory52.2 (2006), pp. 489–509.

[23] F. Champagnat, J. Idier, and Y. Goussard. “Stationary Markov random fields on a finiterectangular lattice”. In: IEEE Trans. Inf. Theory 44.7 (1998), pp. 2901–2916.

[24] P. C. Chaumet and K. Belkebir. “Three-dimensional reconstruction from real data usinga conjugate gradient-coupled dipole method”. In: Inv. Probl. 25.2 (2009), p. 024003.

[25] B. Chen and J. J. Stamnes. “Validity of diffraction tomography based on the first Bornand the first Rytov approximations”. In: Appl. Opt. 37.14 (1998), pp. 2996–3006.

[26] W. Chen et al. “Empirical concentration bounds for compressive holographic bubbleimaging based on a Mie scattering model”. In: Opt. E 23.4 (2015), February.

[27] W. Choi et al. “Tomographic Phase Microscopy: Supplementary Material”. In: Nat.Methods 4.9 (2007), pp. 1–18.

116

[28] D. Colton and R. Kress. Inverse Acoustic and Electromagnetic Scattering Theory. Vol. 93.Springer Science & Business Media, 1992.

[29] T. M. Cover and J. A. Thomas. Elements of Information Theory. New York, NY, USA:Wiley-Interscience, 2006.

[30] G. R. Cross and A. K. Jain. “Markov random field texture models”. In: IEEE Transac-tions on Pattern Analysis and Machine Intelligence 1 (1983), pp. 25–39.

[31] K. Dabov et al. “Image denoising by sparse 3-D transform-domain collaborative filter-ing”. In: IEEE Trans. Image Process. 16.8 (2007), pp. 2080–2095.

[32] I. Daubechies, M. Defrise, and C. D. Mol. “An iterative thresholding algorithm forlinear inverse problems with a sparsity constraint”. In: Commun. Pure Appl. Math.57.11 (2004), pp. 1413–1457.

[33] B. J. Davis et al. “Nonparaxial vector-field modeling of optical coherence tomographyand interferometric synthetic aperture microscopy”. In: J. Opt. Soc. Am. A 24.9 (2007),pp. 2527–2542.

[34] A. J. Devaney. “Inverse-scattering theory within the Rytov approximation”. In: Opt.Lett. 6.8 (1981), pp. 374–376.

[35] D. Donoho. “Compressed sensing”. In: IEEE Trans. Inf. Theory 52.4 (2006), pp. 1289–1306.

[36] D. Donoho, H. Kakavand, and J. Mammen. “The simplest solution to an underdeter-mined system of linear equations”. In: Proc. IEEE Int. Symp. Inf. Theory (ISIT). Seattle,WA, 2006, pp. 1924–1928.

[37] D. L. Donoho. The Kolmogorov sampler. Department of Statistics Technical Report2002-4. Stanford, CA: Stanford University, 2002.

[38] D. L. Donoho. “Compressed sensing”. In: IEEE Trans. Inf. Theory 52.4 (2006), pp. 1289–1306.

[39] D. L. Donoho, A. Maleki, and A. Montanari. “Message-Passing Algorithms for Com-pressed Sensing”. In: Proc. Nat. Acad. Sci. 106.45 (2009), pp. 18914–18919.

[40] R. C. Dubes and A. K. Jain. “Random field models in image analysis”. In: J. Appl.Statist. 16.2 (1989), pp. 131–164.

[41] M. A. T. Figueiredo and A. Jain. “Unsupervised learning of finite mixture models”. In:IEEE Trans. Pattern Anal. Mach. Intell. 24.3 (2002), pp. 381–396.

117

[42] M. A. T. Figueiredo, R. Nowak, and S. J. Wright. “Gradient projection for sparse re-construction: Application to compressed sensing and other inverse problems”. In: IEEEJ. Sel. Topics Signal Proces. 1.4 (2007), pp. 586–597.

[43] M. A. T. Figueiredo and R. D. Nowak. “An EM Algorithm for Wavelet-Based ImageRestoration”. In: IEEE Trans. Image Process. 12.8 (2003), pp. 906–916.

[44] J.-M. Geffrin and P. Sabouroux. “Continuing with the Fresnel database: experimentalsetup and improvements in 3D scattering measurements”. In: Inv. Probl. 25.2 (2009),p. 024001.

[45] J.-M. Geffrin, P. Sabouroux, and C. Eyraud. “Free space experimental scattering databasecontinuation: experimental set-up and measurement precision”. In: Inv. Probl. 21.6(2005), S117–S130.

[46] H. O. Georgii. Gibbs Measures and Phase Transitions. Berlin: de Gruyter, 1998.

[47] J. W. Goodman. Introduction to Fourier Optics. 2nd ed. McGraw-Hill, 1996.

[48] D. Guo, D. Baron, and S. Shamai. “A Single-letter Characterization of Optimal NoisyCompressed Sensing”. In: Proc. Allerton Conf. Commun., Control, and Comput. 2009,pp. 52–59.

[49] D. Guo and S. Verdu. “Randomly spread CDMA: Asymptotics via statistical physics”.In: IEEE Trans. Inf. Theory 51.6 (2005), pp. 1983–2010.

[50] J. Hammersley and P. Clifford. “Markov fields on finite graphs and lattices”. Unpub-lished.

[51] P. Han, R. Niu, and Y. C. Eldar. “Modified distributed iterative hard thresholding”. In:Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP). Brisbane, Australia,2015, pp. 3766–3770.

[52] P. Han et al. “Distributed approximate message passing for sparse signal recovery”. In:Proc. IEEE Global Conf. Signal Inf. Process. (GlobalSIP). Atlanta, GA, 2014, pp. 497–501.

[53] P. Han et al. “Multi-processor approximate message passing using lossy compression”.In: Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP). Shanghai, China,2016, pp. 6240–6244.

[54] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning.Springer, 2001.

118

[55] L. Jacques et al. “Robust 1-bit compressive sensing via binary stable embeddings ofsparse vectors”. In: IEEE Trans. Inf. Theory 59 (4 2013).

[56] S. Jalali and A. Maleki. “Minimum complexity pursuit”. In: Proc. Allerton Conf. Com-mun., Control, Comput. 2011, pp. 1764–1770.

[57] S. Jalali, A. Maleki, and R. G. Baraniuk. “Minimum complexity pursuit for universalcompressed sensing”. In: IEEE Trans. Inf. Theory 60.4 (2014), pp. 2253–2268.

[58] A. Javanmard and A. Montanari. “State evolution for general approximate messagepassing algorithms, with applications to spatial coupling”. In: Inf. and Inference 2.2(2013), pp. 115 –144.

[59] H. M. Jol, ed. Ground Penetrating Radar: Theory and Applications. Amsterdam: Else-vier, 2009.

[60] U. S. Kamilov et al. “Learning Approach to Optical Tomography”. In: Optica 2.6 (2015),pp. 517–522.

[61] U. S. Kamilov et al. “A Recursive Born Approach to Nonlinear Inverse Scattering”. In:IEEE Signal Process. Lett. 23.8 (2016), pp. 1052–1056.

[62] U. S. Kamilov et al. “Optical Tomographic Image Reconstruction Based on Beam Prop-agation and Sparse Regularization”. In: IEEE Trans. Comp. Imag. 2.1 (2016), pp. 59–70,

[63] T Kim et al. “Supplementary Information: White-light diffraction tomography of unla-belled live cells”. In: Nat. Photonics 8 (2014), pp. 256–263.

[64] A. Klenke. Probability Theory: A Comprehensive Course. New York, NY, USA: Springer,2013.

[65] F. Krzakala et al. “Probabilistic reconstruction in compressed sensing: Algorithms, phasediagrams, and threshold achieving matrices”. In: J. Stat. Mech. – Theory E. 2012.08(2012), P08009.

[66] C. Kulske. “Concentration inequalities for functions of Gibbs fields with applicationto diffraction and random Gibbs measures”. In: Commun. Math. Phys. 239.1-2 (2003),pp. 29–51.

[67] J. N. Laska et al. “Trust, But Verify: Fast and Accurate Signal Recovery From 1-BitCompressive Measurements”. In: IEEE Trans. Signal Process. 59.11 (2011), pp. 5289–5301.

119

[68] V. Lauer. “New approach to optical diffraction tomography yielding a vector equationof diffraction tomography and a novel tomographic microscope”. In: J. Microsc. 205.2(2002), pp. 165–176.

[69] M. Leigsnering et al. “Multipath Exploitation in Through-the-wall Radar Imaging UsingSparse Reconstruction”. In: IEEE Trans. Aerosp. Electron. Syst. 50.2 (2014), pp. 920–939.

[70] P. Lezaud. “Chernoff-type bound for finite Markov chains”. In: The Annals of AppliedProbability (1998), pp. 849–867.

[71] H. Li and Z. Lin. “Accelerated Proximal Gradient Methods for Nonconvex Program-ming”. In: Proc. Advances in Neural Information Processing Systems 28. Montreal,Canada, 2015.

[72] M. Li and P. M. B. Vitanyi. An Introduction to Kolmogorov Complexity and Its Appli-cations. Springer-Verlag, New York, 2008.

[73] S. Z. Li. Markov random field modeling in image analysis. Springer Science & BusinessMedia, 2009.

[74] J. W. Lim et al. “Comparative study of iterative reconstruction algorithms for miss-ing cone problems in optical diffraction tomography”. In: Opt. Express 23.13 (2015),pp. 16933–16948.

[75] D. Liu, U. S. Kamilov, and P. T. Boufounos. “Compressive Tomographic Radar Imagingwith Total Variation Regularization”. In: Proc. IEEE 4th International Workshop onCompressed Sensing Theory and its Applications to Radar, Sonar, and Remote Sensing(CoSeRa 2016). Aachen, Germany, 2016, pp. 120–123.

[76] H.-Y. Liu et al. “Compressive Imaging with Iterative Forward Models”. In: Proc. IEEEInt. Conf. Acoustics, Speech and Signal Process. (ICASSP 2017). New Orleans, LA,USA, 2017, pp. 6025–6029.

[77] H.-Y. Liu et al. “SEAGLE: Sparsity-Driven Image Reconstruction under Multiple Scat-tering”. In: (2017). arXiv:1705.04281 [cs.CV].

[78] Y. Ma, D. Baron, and D. Needell. “Two-Part Reconstruction in Compressed Sensing”.In: Proc. IEEE Global Conf. Signal Inf. Process. Austin, TX, 2013.

[79] Y. Ma, D. Baron, and D. Needell. “Two-part reconstruction with noisy-sudocodes”. In:IEEE Trans. Signal Process. 62.23 (2014), pp. 6323–6334.

120

[80] Y. Ma, Y. M. Lu, and D. Baron. “Multiprocessor Approximate Message Passing withColumn-Wise Partitioning”. In: Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process.(ICASSP). New Orleans, LA, 2017.

[81] Y. Ma, C. Rush, and D. Baron. “Analysis of Approximate Message Passing with a Classof Non-Separable Denoisers”. In: Proc. IEEE Int. Symp. Inf. Theory (2017). Full version:https://arxiv.org/abs/1705.03126.

[82] Y. Ma, C. Rush, and D. Baron. “Analysis of Approximate Message Passing with Non-Separable Denoisers and Markov Random Field Priors”. In: (2017). Submitted to IEEETrans. Inf. Theory.

[83] Y. Ma, J. Zhu, and D. Baron. “Compressed sensing via universal denoising and ap-proximate message passing”. In: Proc. Allerton Conf. Commun., Control, and Comput.2014.

[84] Y. Ma, J. Zhu, and D. Baron. “Approximate message passing algorithm with universaldenoising and Gaussian mixture learning”. In: IEEE Trans. Signal Process. 65.21 (2016),pp. 5611–5622.

[85] Y. Ma et al. “Accelerated Image Reconstruction for Nonlinear Diffractive Imaging”. In:(2017). Submitted to IEEE ICASSP. Available online: https://arxiv.org/abs/1708.01663.

[86] Y. Ma et al. “Convergent and Accelerated Image Reconstruction for Nonlinear DiffractiveImaging”. In: (2017). Submitted to IEEE Trans. Image Process.

[87] J. MacQueen. “Some methods for classification and analysis of multivariate observa-tions”. In: Proc. 5th Berkeley Symp. Math. Stat. & Prob. Vol. 1. 14. 1967, pp. 281–297.

[88] C. Metzler, A. Maleki, and R. G. Baraniuk. “From Denoising to Compressed Sensing”.In: IEEE Trans. Inf. Theory 62.9 (2016), pp. 5117 –5114.

[89] E. Mudry et al. “Electromagnetic wave imaging of three-dimensional targets using ahybrid iterative inversion method”. In: Inv. Probl. 28.6 (2012), p. 065007.

[90] D. Needell and J. A. Tropp. “CoSaMP: Iterative signal recovery from incomplete andinaccurate samples”. In: Appl. Computational Harmonic Anal. 26.3 (2009), pp. 301–321.

[91] J. Nocedal and S. J. Wright. Numerical Optimization. 2nd ed. Springer, 2006.

[92] V. Ntziachristos. “Going deeper than microscopy: the optical imaging frontier in biol-ogy”. In: Nat. Methods 7.8 (2010), pp. 603–614.

121

https://arxiv.org/abs/1705.03126



[93] Z. Peng, M. Yan, and W. Yin. “Parallel and distributed sparse optimization”. In: Proc.IEEE 47th Asilomar Conf. Signals, Syst., and Comput. 2013, pp. 659–646.

[94] Y. Plan and R. Vershynin. “One-bit Compressed Sensing by Linear Programming”. In:Comm. Pure Appl. Math. 66 (8 2013), pp. 1275–1297.

[95] R.-E. Plessix. “A review of the adjoint-state method for computing the gradient of afunctional with geophysical applications”. In: Geophysical Journal International 167.2(2006), pp. 495–503.

[96] T. S. Ralston et al. “Inverse scattering for optical coherence tomography”. In: J. Opt.Soc. Am. A 23.5 (2006), pp. 1027–1037.

[97] S. Rangan, A. K. Fletcher, and V. K. Goyal. “Asymptotic analysis of MAP estimationvia the replica method and applications to compressed sensing”. In: IEEE Trans. Inf.Theory 58.3 (2012), pp. 1902–1923.

[98] S. Rangan, P. Schniter, and A. Fletcher. “On the convergence of approximate messagepassing with arbitrary matrices”. In: Proc. IEEE Int. Symp. Inf. Theory (ISIT). 2014,pp. 236–240.

[99] G. Reeves and H. D. Pfister. “The replica-symmetric prediction for compressed sensingwith Gaussian matrices is exact”. In: Proc. IEEE Int. Symp. Information Theory. 2016,pp. 665–669.

[100] G. O. Roberts and J. S. Rosenthal. “Geometric Ergodicity and Hybrid Markov Chains”.In: Electronic Communications in Probability (1997), pp. 13–25.

[101] R. T. Rockafellar and R. J.-B. Wets. Variational Analysis. Springer Science & BusinessMedia, 2009.

[102] L. I. Rudin, S. Osher, and E. Fatemi. “Nonlinear total variation based noise removalalgorithms”. In: Physica D 60.1–4 (1992), pp. 259–268.

[103] C. Rush, A. Greig, and R. Venkataramanan. “Capacity-Achieving Sparse SuperpositionCodes via Approximate Message Passing Decoding”. In: IEEE Trans. Inf. Theory 63.3(2017), pp. 1476–1500.

[104] C. Rush and R. Venkataramanan. “Finite sample analysis of approximate message pass-ing”. In: Proc. IEEE Int. Symp. Inf. Theory (2015). Full version: https://arxiv.org/abs/1606.01800.

122



[105] S. Sarvotham, D. Baron, and R. G. Baraniuk. “Sudocodes – Fast Measurement andReconstruction of Sparse Signals”. In: Proc. Int. Symp. Inf. Theory (ISIT). Seattle,WA, 2006.

[106] J. Sharpe et al. “Optical Projection Tomography as a Tool for 3D Microscopy and GeneExpression Studies”. In: Science 296.5567 (2002), pp. 541–545.

[107] K. Sivaramakrishnan and T. Weissman. “Universal Denoising of Continuous AmplitudeSignals with Applications to Images”. In: Proc. Int. Conf. Image Process. 2006, pp. 2609–2612.

[108] K. Sivaramakrishnan and T. Weissman. “Universal denoising of discrete-time continuous-amplitude signals”. In: IEEE Trans. Inf. Theory 54.12 (2008), pp. 5632–5660.

[109] K. Sivaramakrishnan and T. Weissman. “A context quantization approach to universaldenoising”. In: IEEE Trans. Signal Process. 57.6 (2009), pp. 2110–2129.

[110] S. Som and P. Schniter. “Compressive Imaging using Approximate Message Passing anda Markov-Tree Prior”. In: IEEE Trans. Signal Process. 60.7 (2012), pp. 3439–3448.

[111] E. Soubies, T.-A. Pham, and M. Unser. “Efficient inversion of multiple-scattering modelfor optical diffraction tomography”. In: Opt. Express 25.18 (2017), pp. 21786–21800.

[112] Y. Sung and R. R. Dasari. “Deterministic regularization of three-dimensional opticaldiffraction tomography”. In: J. Opt. Soc. Am. A 28.8 (2011), pp. 1554–1561.

[113] Y. Sung et al. “Optical Diffraction Tomography for high resolution live cell imaging”.In: Opt. Express 17.1 (2009), pp. 266–277.

[114] J. Tan, Y. Ma, and D. Baron. “Compressive imaging via approximate message passingwith wavelet-based image denoising”. In: Proc. IEEE Global Conf. Signal Inf. Process.Atlanta, GA, 2014.

[115] J. Tan, Y. Ma, and D. Baron. “Compressive imaging via approximate message passingwith image denoising”. In: IEEE Trans. Signal Process. 63.8 (2015), pp. 2085–2092.

[116] J. Tan et al. “Application of approximate message passing in coded aperture snapshotspectral imaging”. In: Proc. IEEE Global Conf. Signal Inf. Process. 2015.

[117] J. Tan et al. “Compressive hyperspectral imaging via approximate message passing”. In:IEEE J. Sel. Topics Signal Process. 10.2 (2016), pp. 389–401.

[118] J. Tan, Y. Ma, and D. Baron. “Compressive Imaging via Approximate Message Passingwith Image Denoising”. In: IEEE Trans. Signal Processing 63.8 (2015), pp. 2085–2092.

123

[119] L. Tian et al. “Quantitative measurement of size and three-dimensional position of fast-moving bubbles in air–water mixture flows using digital holography”. In: Appl. Opt. 49.9(2010), pp. 1549–1554.

[120] P. M. van den Berg and R. E. Kleinman. “A contrast source inversion method”. In: Inv.Probl. 13.6 (1997), pp. 1607–1620.

[121] J. Vila and P. Schniter. “Expectation-maximization Gaussian-mixture approximate mes-sage passing”. In: IEEE Trans. Signal Process. 61.19 (2013), pp. 4658–4672.

[122] J. Vila, P. Schniter, and J. Meola. “Hyperspectral Unmixing via Turbo Bilinear Gen-eralized Approximate Message Passing”. In: IEEE Trans. Comput. Imag. 1.3 (2015),pp. 143–158.

[123] J. Vila et al. “Adaptive damping and mean removal for the generalized approximate mes-sage passing algorithm”. In: IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP).2015, pp. 2021–2025.

[124] A. Wagadarikar et al. “Spectral image estimation for coded aperture snapshot spectralimagers”. In: Proc. SPIE. 2008, p. 707602.

[125] X. Wang, D. B. Dunson, and C. Leng. “DECOrrelated feature space partitioning fordistributed sparse regression”. In: Neural Inf. Process. Syst. (NIPS). Ed. by D. D. Leeet al. 2016, pp. 802–810.

[126] M. Yan, Y. Yang, and S. Osher. “Robust 1-bit Compressive Sensing Using AdaptiveOutlier Pursuit”. In: IEEE Trans. Signal Process. 60.7 (2012), pp. 3868–3875.

[127] Z. Yang, L. Xie, and C. Zhang. “Variational Bayesian Algorithm for Quantized Com-pressed Sensing”. In: IEEE Trans. Signal Process. 61.11 (2013), pp. 2815–2824.

[128] T. Zhang et al. “Far-field diffraction microscopy at λ/10 resolution”. In: Optica 3.6(2016), pp. 609–612.

[129] Y. Zhou et al. “Parallel feature selection inspired by group testing”. In: Neural Inf.Process. Syst. (NIPS). 2014, pp. 3554–3562.

[130] J. Zhu and D. Baron. “Performance regions in compressed sensing from noisy measure-ments”. In: Proc. IEEE Conf. Inf. Sci. Syst. (CISS). Baltimore, MD, 2013.

[131] J. Zhu, D. Baron, and A. Beirami. “Optimal trade-offs in multi-processor approximatemessage passing”. In: Arxiv preprint arXiv:1601.03790 (2016).

124

[132] J. Zhu, D. Baron, and M. F. Duarte. “Recovery from Linear Measurements with Complexity-Matching Universal Signal Estimation”. In: IEEE Trans. Signal Process. 63.6 (2015),pp. 1512–1527.

[133] J. Zhu, A. Beirami, and D. Baron. “Performance Trade-Offs in Multi-Processor Ap-proximate Message Passing”. In: Proc. IEEE Int. Symp. Inf. Theory (ISIT). Barcelona,Spain, 2016, pp. 680–684.

[134] J. Ziniel, S. Rangan, and P. Schniter. “A generalized framework for learning and recoveryof structured sparse signals”. In: Proc. IEEE Stat. Signal Process. Workshop (SSP). AnnArbor, MI, 2012, pp. 325–328.

125

APPENDICES

126

Appendix A

Chapter 2 Appendix

A.1 Concentration Lemmas

In the following, ε > 0 is assumed to be a generic constant, with additional conditions specified

whenever needed. The proof of the Lemmas in this section can be found in [104].

Lemma A.1.1. (Concentration of Sums.) If random variables X1, . . . , XM satisfy P (|Xi| ≥ε) ≤ e−nκiε2 for 1 ≤ i ≤M , then

P

(∣∣∣∣∣M∑i=1

Xi

∣∣∣∣∣ ≥ ε)≤

M∑i=1

P(|Xi| ≥

ε

M

)≤Me−n(mini κi)ε

2/M2.

Lemma A.1.2. (Concentration of Square Roots.) Let c 6= 0. Then

If P(∣∣X2

n − c2∣∣ ≥ ε) ≤ e−κnε2 , then P (||Xn| − |c|| ≥ ε) ≤ e−κn|c|

2ε2 .

Lemma A.1.3. For a standard Gaussian random variable Z and ε > 0, P (|Z| ≥ ε) ≤ 2e−12ε2.

Lemma A.1.4. (χ2-concentration.) For Zi, i ∈ [n] that are i.i.d. ∼ N (0, 1), and 0 ≤ ε ≤ 1,

P

(∣∣∣∣∣ 1nn∑i=1

Z2i − 1

∣∣∣∣∣ ≥ ε)≤ 2e−nε

2/8.

Lemma A.1.5. [17] Let X be a centered sub-Gaussian random variable with variance factor

ν, i.e., lnE[etX ] ≤ t2ν2 , ∀t ∈ R. Then X satisfies:

127

1. For all x > 0, P (X > x) ∨ P (X < −x) ≤ e−x2

2ν , for all x > 0.

2. For every integer k ≥ 1, E[X2k] ≤ 2(k!)(2ν)k ≤ (k!)(4ν)k.

A.2 Other Useful Lemmas

In this section, when the results are standard, they are presented without proof.

Lemma A.2.1. (Stein’s lemma.) For zero-mean jointly Gaussian random variables Z1, Z2, and

any function f : R → R for which E[Z1f(Z2)] and E[f ′(Z2)] both exist, we have E[Z1f(Z2)] =

E[Z1Z2]E[f ′(Z2)].

Lemma A.2.2. (Products of Lipschitz Functions are PL(2).) Let f : Rp → R and g : Rp → Rbe Lipschitz functions. Then the function h : Rp → R defined as h(x) := f(x)g(x) is PL(2).

Lemma A.2.3. Let Λ be defined in (2.5). For each r = 1, . . . , t, let τr > 0 be a constant and

let Zr = Zri i∈Λ have i.i.d. standard normal entries. Suppose f : R|Λ|(t+1) → R is PL(2) with

PL constant L, then the function f : RΛ → R defined as f(s) := EZ1,...,Zt[f(τ1Z

1, . . . , τtZt, s)

]is PL(2).

Proof. Take arbitrary x,y ∈ RΛ,

|f(x)− f(y)| = |E [f(τ1Z1, . . . , τtZt,x)− f(τ1Z1, . . . , τtZt,y)] |(a)

≤ E [|f(τ1Z1, . . . , τtZt,x)− f(τ1Z1, . . . , τtZt,y)|](b)

≤ E[L(1 + 2t∑

r=1

τr‖Zr‖+ ‖x‖+ ‖y‖)‖x− y‖]

(c)

≤ L(1 + 2|Λ|√

2

π

t∑r=1

τr + ‖x‖+ ‖y‖)‖x− y‖ ≤ L

(1 + 2|Λ|

√2

π

t∑r=1

τr

)(1 + ‖x‖+ ‖y‖)‖x− y‖.

In the above, step (a) follows from Jensen’s inequality, step (b) holds since f is PL(2) and using

the triangle inequality, and step (c) follows from E‖Zr‖ ≤∑

i∈Λ E |[Zr]i| = |Λ|√

2π .

Lemma A.2.4. Let Γ and Λ be defined in (2.4) and (2.5), respectively, and Λi be Λ translated

to be centered at i for each i ∈ Γ. Let f : RΛ → R be a PL(2) function with constant L.

Moreover, let fi : RΛi∩Γ → R be defined as fi(v) := f(Ti(v)), where Ti : RΛi∩Γ → RΛ is defined

in (2.10). Then fi is PL(2) for all i ∈ Γ.

128

Proof. Let i be an arbitrary but fixed index in Γ. Let d = |Λ|, ai := |Λi ∩ Γ|, and bi = d − ai.For any x,y ∈ RΛi∩Γ, we have that

|f(x)− f(y)| = |f(Ti(x))− f(Ti(y))|(a)

≤ L(1 + ‖Ti(x)‖+ ‖Ti(y))‖Ti(x)− Ti(y)‖

(b)= L

1 +

√√√√√bi

1

ai

∑j∈Λi∩Γ

xj

2

+ ‖x‖2 +

√√√√√bi

1

ai

∑j∈Λi∩Γ

yj

2

+ ‖y‖2

·

√√√√√bi

1

ai

∑j∈Λi∩Γ

xj − yj

2

+ ‖x− y‖2

(c)

≤ L

(1 +

√biai‖x‖2 + ‖x‖2 +

√biai‖y‖2 + ‖y‖2

)√biai‖x− y‖2 + ‖x− y‖2

=d

aiL(aid

+ ‖x‖+ ‖y‖)‖x− y‖ ≤ d

aiL(1 + ‖x‖+ ‖y‖)‖x− y‖,

where step (a) follows from the pseudo-Lipschitz property of f , step (b) from our definition of

Ti in (2.10), and step (c) from Lemma A.2.5.

Lemma A.2.5. For any scalars a1, ..., at and positive integer m, we have (|a1|+ . . .+ |at|)m ≤tm−1

∑ti=1 |ai|

m. Consequently, for any vectors u1, . . . ,ut ∈ RN ,∥∥∑t

k=1 uk∥∥2 ≤ t

∑tk=1 ‖uk‖

2.

A.3 Concentration with Dependencies for Theorem 2.2.1

We first state a concentration result exising in the literature about functions acting on random

fields that satisfy the Dobrushin uniqueness condition in Lemma A.3.1. Then we use Lemma

A.3.1 to obtain Lemma A.3.2, which is needed to prove Ht(b).

Lemma A.3.1. [66, Theorem 1] Suppose that the random field X = (Xi)i∈Γ taking values in

EΓ is distributed according to a Gibbs measure µ that obeys the Dobrushin uniqueness condition

with Dobrishin constant c, and also the transposed Dobrishin uniqueness condition with constant

c∗. Suppose that F is a real function on EΓ with E [exp(tF (X))] < ∞ for all real t. Then we

have

P (F (X)− E [F (X)] > r) ≤ exp

(−r

2

2

(1− c)(1− c∗)‖δ(F )‖2

`2

), ∀r ≥ 0. (A.1)

Here δ(F ) := (δi(F ))i∈Γ is the variation vector of F , where δi(F ) := supξ,ξ′;ξic=ξ′ic|F (ξ)− F (ξ′)|

denotes the variation of F at the site i. Its `2-norm is defined as ‖δ(F )‖2`2 :=∑

i∈Γ(δi(F ))2. If

129

this norm is infinite, then the statement is empty (and thus correct).

Lemma A.3.2. Let Γ and Λ be defined in (2.4) and (2.5), respectively, and let X = (Xi)i∈Γ be

a stationary Markov random field with a unique distribution measure µ on EΓ ⊂ RΓ. Assume

that µ satisfies the Dobrushin uniqueness condition and the transposed Dobrushin uniqueness

condition with constants c and c∗, respectively. Suppose that the state space E is bounded,

meaning that there exists an M such that |x| ≤M , for all x ∈ E. Let fi : RΛi∩Γ → R, where Λi

is Λ being translated to be centered at location i ∈ Γ, be a PL(2) function with pseudo-Lipschitz

constant Li, for all i ∈ Γ. Then for all ε ∈ (0, 1) there exist K,κ > 0 such that

P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

(fi(XΛi∩Γ)− E [fi(XΛi∩Γ)])

∣∣∣∣∣ ≥ ε)≤ Ke−κ|Γ|ε2 . (A.2)

Proof. Let the function F in Lemma A.3.1 be defined as F (X) :=∑

i∈Γ fi(XΛi∩Γ). In order to

apply Lemma A.3.1, we need to calculate ‖δ(F )‖2`2 . Let d := |Λ| and L := maxi∈Γ Li, then we

have that

δi(F ) = supξ,ξ′∈EΓ

ξic=ξ′ic

∣∣F (ξ)− F (ξ′)∣∣ = sup

ξ,ξ′∈EΓ

ξic=ξ′ic

∣∣∣∣∣∣∑

j:i∈Λj∩Γ

fj(ξΛj∩Γ)− fj(ξ′Λj∩Γ)

∣∣∣∣∣∣(a)

≤ supξ,ξ′∈EΓ

ξic=ξ′ic

∑j:i∈Λj∩Γ

Lj(1 + ‖ξΛj∩Γ‖+ ‖ξ′Λj∩Γ‖)‖ξΛj∩Γ − ξ′Λj∩Γ‖

(b)

≤ dL(1 + 2√dM)2

√dM.

In the above, step (a) uses the triangle inequality and the pseudo-Lipschitz property of f . Step

(b) follows from the fact that |x| ≤M, for all x ∈ E and that Lj ≤ L for all j ∈ Γ.

Therefore,

‖δ(F )‖2`2 ≤ |Γ|(dL(1 + 2√dM)2

√dM)2.

Now applying Lemma A.3.1, we have

P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

(fi(XΛi∩Γ)− E [fi(XΛi∩Γ)])

∣∣∣∣∣ ≥ ε)≤ 2 exp

(− |Γ|ε2(1− c)(1− c∗)

(dL(1 + 2√dM)2

√dM)2

). (A.3)

130

Lemma A.3.3 provides a technical result about pseudo-Lipschitz functions with sub-Gaussian

inputs, which will be used to prove Lemma A.3.4.

Lemma A.3.3. Let X be a Rd-valued random vector whose entries have a sub-Gaussian

marginal distribution with variance factor ν as in Lemma A.1.5. Let X be an independent

copy of X. If f : Rd → R is a PL(2) function with pseudo-Lipschitz constant L, then the

expectation E [exp (rf(X))] satisfies the following for 0 < r < [5L(2dν + 24d2ν2)1/2]−1,

E[erf(X)] ≤ E[er(f(X)−f(X)] ≤ [1− 25r2L2(dν + 12d2ν2)]−1 ≤ e50r2L2(dν+12d2ν2). (A.4)

Proof. Without loss of generality, assume E[f(X)] = 0. By Jensen’s inequality, E[exp(−rf(X))] ≤exp(−rE[f(X)]) = 1. Therefore,

E [exp (rf(X))] ≤ E [exp (rf(X))]E[exp(rf(X))] = E[exp(r(f(X)− f(X)))],

which provides the first upper bound in (A.4). Next,

E[er(f(X)−f(X))](a)

≤ E[erL(1+‖X‖+‖X‖)‖X−X‖]

(b)=

∞∑q=0

(rL)q

q!E[((1 + ‖X‖+ ‖X‖)‖X− X‖)q]

(c)=∞∑k=0

(rL)2k

(2k)!E[((1 + ‖X‖+ ‖X‖)(‖X‖+ ‖X‖))2k], (A.5)

where step (a) follows pseudo-Lipschitz property, step (b) follows Fubini’s theorem, and step

(c) holds because the odd order terms are zero, along with the triangle inequality. Now consider

the expectation in the last term in the string given in (A.5).

E[((1 + ‖X‖+ ‖X‖)(‖X‖+ ‖X‖))2k] = E[(‖X‖+ ‖X‖+ ‖X‖2 + ‖X‖2 + 2‖X‖‖X‖)2k]

(d)

≤ 52k−1(2E‖X‖2k + 2E ‖X‖4k + 22kE[‖X‖2k‖X‖2k])(e)

≤ 52k−1(4(k!)(2dν)k + 4(2k)!(2dν)2k + 4(k!)2(4dν)2k). (A.6)

In the above, step (d) follows from Lemma A.2.5 and step (e) from another application of

131

Lemma A.2.5 and Lemma A.1.5. Now plugging (A.6) back into (A.5), we find

E[er(f(X)−f(X))] ≤∞∑k=0

(5rL)2k

5(2k)!(4(k!)(2dν)k + 4(2k)!(2dν)2k + 4(k!)2(4dν)2k)

(f)

≤ 1+4

5

∞∑k=1

(5rL)2k((dν)k+(2dν)2k+2k(2dν)2k)≤∞∑k=0

(25r2L2)k(dν + 12d2v2)k

(g)=

1

1− 25r2L2(dν + 12d2ν2)

(h)

≤ exp(50r2L2(dν + 12d2ν2)r2), for 0 < r <1

5L√

2dν + 24d2ν2,

where step (f) follows from the fact that 2k(k!)2 ≤ (2k)!, which can be seen by noting (2k)!k! =∏k

j=1(k + j) = k!∏kj=1

(kj + 1

)≥ (k!)2k, step (g) follows for 0 < r < [25L2(dν + 12d2ν2)]−1/2

providing the second bound in (A.4), and step (h) uses the inequality (1 − x)−1 ≤ e2x for

x ∈ [0, 1/2] for the final bound in (A.4).

Lemma A.3.4 provides a concentration inequality for sums of pseudo-Lipschitz functions

acting of overlapping subsets of jointly Gaussian random variables.

Lemma A.3.4. Let Γ and Λ be defined in (2.4) and (2.5), respectively. For each r = 1, . . . , t, let

Zr = Zri i∈Γ have i.i.d. N (0, 1) entries, and for all r, s = 1, . . . , t and i 6= j, Zri is independent

of Zsj . Moreover, for each i ∈ Γ, let (Z1i , . . . , Z

ti ) be jointly Gaussian with covariance matrix

K ∈ Rt×t.For each i ∈ Γ, define Yi := (Z1

Λi∩Γ, . . . ,ZtΛi∩Γ), where Λi is Λ being translated to be centered

at location i ∈ Γ. Let fi : R|Λi∩Γ|t → R be a PL(2) function for all i ∈ Γ. Then for all ε ∈ (0, 1),

there exist K,κ > 0 such that

P

(∣∣∣∣∣ 1

|Γ|∑i∈Γ

(fi(Yi)− E [fi(Yi)])

∣∣∣∣∣ ≥ ε)≤ Ke−κ|Γ|ε2 . (A.7)

Proof. In the following, we prove the case for p = 2 and the proof for p = 1, 3 follows sim-

ilarly. Without loss of generality, let i = (i1, i2) and Γ := (i1, i2)1≤i1,i2≤n, hence |Γ| = n2.

Further, assume without loss of generality that E [fi(Yi)] = 0, for all i ∈ Γ. In what follows we

demonstrate the upper-tail bound:

P

(1

|Γ|∑i∈Γ

fi(Yi) ≥ ε

)≤ Ke−κ|Γ|ε2 , (A.8)

132

and the lower-tail bound follows similarly. Together they provide the desired result.

Using the Cramer-Chernoff method:

P

(1

|Γ|∑i∈Γ

fi(Yi) ≥ ε

)= P

(er∑i∈Γ fi(Yi) ≥ er|Γ|ε

)≤ e−r|Γ|εE

[er∑i∈Γ fi(Yi)

]∀r > 0.

(A.9)

Let d2 = |Λ| and Li be the pseudo-Lipschitz parameters associated with functions fi for i ∈ Γ

and define L := maxi∈Γ Li. In the following, we will show that

E[er∑i∈Γ fi(Yi)

]≤ exp

(κ′|Γ|r2

), for 0 < r <

(10Ld2

√2td2 + 24t2d4

)−1, (A.10)

where κ′ is any constant that satisfies κ′ ≥ 450L2d2(d2 + 12d4). Then plugging (A.10) into

(A.9), we can obtain the desired result in (A.8):

P

(1

|Γ|∑i∈Γ

fi(Yi) ≥ ε

)≤ exp

(−|Γ|(rε− κ′r2)

).

Set r = ε/(2κ′), the choice that maximizes the term (rε − κ′r2) over r in the exponent in the

above. We can ensure that ∀ε ∈ (0, 1), r falls within the region required in (A.10) by choosing

κ′ large enough.

We now show (A.10). Recall that i = (i1, i2) and we write Yi as Yi1,i2 in the following.

Define index sets

Ij1,j2 :=

(j1 + k1d, j2 + k2d)

∣∣∣∣ k1 = 0, ..., bn− j1dc, k2 = 0, ..., bn− j2

dc

(A.11)

for j1, j2 = 1, ..., d, let Cj1,j2 denote the cardinality of Ij1,j2 . We notice that for any fixed (j1, j2),

the Yi1,i2 ’s are i.i.d. for all (i1, i2) ∈ Ij1,j2 . Also, we have Γ = ∪dj1,j2=1Ij1,j2 , and Ij1,j2∩Is1,s2 = ∅,for (j1, j2) 6= (s1, s2), making the collection I1,1, I1,2, . . . , Id,d a partition of Γ. Therefore,

∑i∈Γ

fi(Yi) =

d∑j1,j2=1

∑(i1,i2)∈Ij1,j2

fi1,i2(Yi1,i2) =

d∑j1,j2=1

pj1,j2 ·1

pj1,j2

∑(i1,i2)∈Ij1,j2

fi1,i2(Yi1,i2),

133

where 0 < pj1,j2 < 1 are probabilities satisfying∑d

j1,j2=1 pj1,j2 = 1. Using the above,

E

[exp

(r∑i∈Γ

fi(Yi)

)]= E

exp

d∑j1,j2=1

pj1,j2 ·r

pj1,j2

∑(i1,i2)∈Ij1,j2

fi1,i2(Yi1,i2)

(a)

≤d∑

j1,j2=1

pj1,j2E

exp

r

pj1,j2

∑(i1,i2)∈Ij1,j2

fi1,i2(Yi1,i2)

(b)=

d∑j1,j2=1

pj1,j2∏

(i1,i2)∈Ij1,j2

E[exp

(r

pj1,j2fi1,i2(Yi1,i2)

)](c)

≤d∑

j1,j2=1

pj1,j2 exp

(50Cj1,j2L

2r2(td2 + 12t2d4)

p2j1,j2

), (A.12)

where step (a) follows from Jensen’s inequality, step (b) from the fact that the Yi1,i2 ’s are

independent for (i1, i2) ∈ Ij1,j2 , and step (c) from Lemma A.3.3 with variance factor ν = 1 and

restriction

0 < r <(

5L√

2td2 + 24t2d4)−1

min(j1,j2)

pj1,j2 . (A.13)

Let pj1,j2 =√Cj1,j2/C, where C =

∑dj1,j2=1

√Cj1,j2 ensuring that

∑dj1,j2=1 pj1,j2 = 1. Then,

we have

d∑j1,j2=1

pj1,j2 exp

(50Cj1,j2L

2r2(td2 + 12t2d4)

p2j1,j2

)= e50C2L2r2(td2+12t2d4)

(a)

≤ e450d2n2L2(td2+12t2d4)r2 ≤ eκ′|Γ|r2,

whenever κ′ ≥ 450d2L2(td2 + 12t2d4). In the above, step (a) follows from:

C2 =

d∑j1,j2=1

√Cj1,j2

2

=d∑

j1,j2=1

Cj1,j2 +d∑

j1,j2=1

∑(k1,k2)6=(j1,j2)

√Cj1,j2Ck1,k2

(b)

≤ n2 + d2(d2 − 1)C1,1

(c)

≤ d2n2 + 4(d2 − 1)(d2 + nd) < 9d2n2,

where step (b) holds because C1,1 = max(j1,j2)Cj1,j2 and step (c) holds because

C1,1 =

(bn− 1

dc+ 1

)2

≤(nd

+ 2)2.

134

Finally, we consider the effective region for r as required in (A.13). Notice that

min(j1,j2)

pj1,j2 =

√Cd,d

C=

√Cd,d∑d

j1,j2=1

√Cj1,j2

≥√Cd,d

d2√C1,1

=1

d2

bn−dd c+ 1

bn−1d c+ 1

=1

d2

bnd cbn−1

d c+ 1≥ 1

d2

bn−1d c

bn−1d c+ 1

≥ 1

2d2.

Hence, if we require 0 < r <(

10Ld2√

2td2 + 24t2d4)−1

, then (A.13) is satisfied.

135

A.4 Concentration with Dependencies for Theorem 2.5.1

We first list some notation that will be used frequently in the following. Let E ⊂ Rd for some

d ∈ N and π a probability measure on E. Let f : E → R be a measurable function. We use the

following notation:

The sup-norm: ‖f‖∞ := supx∈E |f(x)|;

The L2(π)-norm: for measurable function f , ‖f‖22,π :=∫E |f(x)|2π(dx); for signed measure

ν,

‖ν‖22,π :=

∫E

∣∣ dνdπ

∣∣2 dπ if ν π,

∞ otherwise,

where the L2(π)-norm for ν π 1 is induced from the inner-product: for µ, ν π,

〈µ, ν〉π :=

∫E

dν

dπ

dµ

dπdπ. (A.14)

The expected value: Eπf :=∫E f(x)π(dx);

The set of all π-square-integrable measures:

L2(π) := ν π :

∫E

∣∣∣∣dνdπ∣∣∣∣2 dπ <∞;

The set of all zero-mean π-square-integrable functions:

L20(π) := f : R→ R : Eπf = 0, ‖f‖2,π <∞,

where the subscript 0 represents zero-mean.

The following lemma exists in the literature and is stated here, without proof, for complete-

ness. The proof can be found in the citation. Lemma A.4.1 tells us that if a Markov chain is

reversible and geometrically ergodic as defined in Definition 2.5.1, then its associated linear

operator has a spectral gap, the level of which controls the chain’s mixing time.

Lemma A.4.1. [100, Theorem 2.1] Consider a Markov chain with state space E, probability

transition measure r(x, dx′), stationary distribution measure γ, and linear operator R associated

1For two measures ν and γ, ν γ denotes the relationship that ν is absolutely continuous w.r.t. γ, and dνdγ

denotes the Radon-Nikodym derivative.

136

with r(x, dx′) such that νR(dx) =∫E r(x

′, dx)ν(dx′) for measure ν. If the Markov chain is

reversible and geometrically ergodic (Definition 2.5.1), then R has an L2(γ) spectral gap. That

is, for each signed measure ν with ν(E) = 0 and ‖ν‖2,γ < ∞, there is a 0 < ρ < 1 such that

‖νR‖2,γ ≤ ρ‖ν‖2,γ .

Notice that the definition of spectral gap above is identical to the definition provided in

2.5.1. To see this, note that γ is an eigen-function of R with eigenvalue 1, R is self-adjoint

since the chain is reversible, and the eigen-functions of a self-adjoint operator are orthogonal,

hence the rest of the eigen-functions are in the space that is perpendicular to γ, which is

ν π |〈ν, γ〉γ = 0 , where 〈ν, γ〉γ =∫Edνdγ

dγdγdγ = ν(E) by the definition of inner-product in

(A.14).

Lemma A.4.2 says that if Xii∈N and Xii∈N are reversible, geometrically ergodic Markov

chains, which are independent of each other, then the process defined as (Xi, Xi)i∈N is also

reversible and geometrically ergodic.

Lemma A.4.2. Let Xii∈N be a time-homogeneous Markov chain on a state space E with

stationary distribution measure γ. Assume that Xii∈N is reversible, geometrically ergodic on

L2(γ) as defined in Definition 2.5.1. Let Xii∈N be an independent copy of Xii∈N. Then

the new sequence defined as (Xi, Xi)i∈N is a Markov chain on E × E that is reversible and

geometrically ergodic on L2(γ × γ).

Proof. Assume Xii∈N has transition probability measure r(x, dx′). Since Xii∈N is indepen-

dent of Xii∈N, we have that the transition probability measure and stationary distribution

measure of (Xi, Xi)i∈N are, respectively,

r((x, x), (dx′, dx′)) = r(x, dx′)r(x, dx′) and γ(dx, dx) = γ(dx)γ(dx).

In what follows, we demonstrate that r((x, x), (dx′, dx′)) and γ(dx, dx) satisfy the reversibility

and geometric ergodicity as defined in Definition 2.5.1.

The reversibility of the coupled chain follows from the reversibility of the individual chains:

r((x, x), (dx′, dx′))γ(dx, dx) = r(x, dx′)γ(dx)r(x, dx′)γ(dx) = r(x′, dx)γ(dx′)r(x′, dx)γ(dx′)

= r((x′, x′), (dx, dx))γ(dx′, dx′).

To prove geometric ergodicity, we want to show that there exists ρ < 1 such that for each

137

probability measure ν = ν × ν ∈ L2(γ), there is Cν <∞ such that

sup(A,A)∈B(E×E)

∣∣∣∣∫E×E

rn(z, (A, A))ν(dz)− γ(A, A)

∣∣∣∣ ≤ Cνρ,where B(E × E) is the Borel sigma-algebra on E × E. Notice that∣∣∣∣∫

E×Ern(z, (A, A))ν(dz)− γ(A, A)

∣∣∣∣ =

∣∣∣∣∫Ern(x,A)ν(dx)

∫Srn(x, A)ν(dx)− γ(A)γ(A)

∣∣∣∣=

∣∣∣∣(∫Srn(x,A)ν(dx)− γ(A)

)(∫Srn(x, A)ν(dx)− γ(A)

)+γ(A)

(∫Srn(x,A)ν(dx)− γ(A)

)+ γ(A)

(∫Srn(x, A)ν(dx)− γ(A)

)∣∣∣∣(a)

≤∣∣∣∣∫Srn(x,A)ν(dx)− γ(A)

∣∣∣∣ ∣∣∣∣∫Srn(x, A)ν(dx)− γ(A)

∣∣∣∣+

∣∣∣∣∫Srn(x,A)ν(dx)− γ(A)

∣∣∣∣+

∣∣∣∣∫Srn(x, A)ν(dx)− γ(A)

∣∣∣∣ ,where step (a) used the triangle inequality and 0 ≤ γ(A) ≤ 1 for all A ∈ B(E). Taking the

supremum of both sides of the above inequality,

sup(A,A)∈B(E×E)

∣∣∣∣∫E×E

rn(z, (A, A))ν(dz)− γ(A, A)

∣∣∣∣≤ sup

A∈B(E)


∣∣∣∣ supA∈B(E)


∣∣∣∣+ supA∈B(E)


∣∣∣∣+ supA∈B(E)


∣∣∣∣(a)

≤ C2νρ

2n + 2Cνρn

(b)< (C2

ν + 2Cν)ρn,

where we have Cν := C2ν+2Cν <∞. Step (a) follows from the fact that Xii∈N is geometrically

ergodic and the definition of such in Definition 2.5.1 and step (b) since 0 < ρ < 1.

Lemma A.4.3 says that if the Markov chain Xii∈N is reversible and geometrically ergodic

then the process Yii∈N defined as Yi = (Xdi−d+1, ..., Xdi) has a spectral gap, the level of which

controls the process’s mixing time.

Lemma A.4.3. Let Xii∈N be a time-homogeneous Markov chain with stationary distribution

measure γ on E ⊂ R. Assume that Xii∈N is reversible, geometrically ergodic on L2(γ) as de-

138

fined in Definition 2.5.1. Define Yii∈N as Yi = (Xdi−d+1, ..., Xdi) ∈ Ed, where d is an integer.

Then Yii∈N is a stationary, time-homogeneous Markov chain. Moreover, let the transition

probability measure and stationary distribution measure of Yii∈N be denoted by p(y, dy′) and

π, respectively, then the linear operator P defined as Ph(y) :=∫Ed h(y′)p(y, dy′) satisfies

βP := suph∈L2

0(π)

‖Ph‖2,π‖h‖2,π

< 1. (A.15)

Proof. The Markov property and time-homogeneous property follow directly by the construc-

tion of Yii∈N. We now verify that π is a stationary distribution for p(y, dy′). That is, we need

to show that∫Ed p(y, dy

′)π(dy) = π(dy′). Assume Xii∈N has transition probability measure

r(x, dx′). First we write p(y, dy′) and π in terms of r(x, dx′) and γ:

π(dy) = π(dy1, ..., dyd) =

d∏i=2

r(yi−1, dyi)γ(dy1)

p(y, dy′) = P(Y2 ∈ dy′|Y1 = y

)= P

(Xd+1 ∈ dy′1, ..., X2d ∈ dy′d|X1 = y1, ..., Xd = yd

)= P

(Xd+1 ∈ dy′1, ..., X2d ∈ dy′d|Xd = yd

)= r(yd, dy

′1)

d∏i=2

r(y′i−1, dy′i). (A.16)

Then we have∫y∈Ed

p(y, dy′)π(dy)(a)=

∫y∈Ed

r(yd, dy′1)

d∏i=2

r(y′i−1, dy′i)

d∏i=2


=d∏i=2


∫y∈Ed

r(yd, dy′1)

d∏i=2


(b)=

d∏i=2

r(y′i−1, dy′i)γ(dy′1) = π(dy′),

where step (a) follows from (A.16), and step (b) since γ is the stationary distribution measure

for r(x, dx′). Hence, we have verified that π is a stationary distribution measure for p(y, dy′).

We now prove (A.15). Note that βP is a property of the Markov chain Yii∈N. If Yii∈N was

reversible and geometrically ergodic, then we would be able show (A.15) using Lemma A.4.1

directly. However, Yii∈N is not reversible, hence, we instead relate βP to a similar property

for the original Xii∈N chain, which we assume is reversible and geometrically ergodic, then

use Lemma A.4.1.

139

Take arbitrary h ∈ L20(π), we have

‖Ph‖22,π‖h‖22,π

=

∫Ed

(∫Ed h(y′)p(y, dy′)

)2π(dy)∫

Ed h2(y)π(dy)

. (A.17)

First consider the numerator of (A.17). Plugging in the expressions for p(y, dy′) and π(dy)

defined in (A.16), we write the numerator as

∫Ed

(∫Edh(y′)r(yd, dy

′1)

d∏i=2


)2 d∏i=2


(a)=

∫E

(∫Edh(y′)r(yd, dy

′1)

d∏i=2


)2

γ(dyd)

(b)=

∫E

(∫Eh(y′1)r(yd, dy

′1)

)2

γ(dyd)(c)= ‖Rh‖22,γ . (A.18)

Step (a) holds because γ is the stationary distribution measure for r(x, dx′) and the integrand

inside the square does not involve (y1, ..., yd−1). In step (b), the function h : R → R is defined

as

h(y′1) :=

∫Ed−1

h((y′1, ..., y′d))

d∏i=2

r(y′i−1, dy′i). (A.19)

In step (c), the operator R is defined as Rh(x) :=∫E h(x′)r(x, dx′).

We next show that h ∈ L20(γ) for h defined in (A.19). Notice that∫

Eh(y′1)γ(dy′1)

(a)=

∫Edh((y′1, ..., y

′d))π(dy′1, ..., dy

′d) =

∫Edh(y′)π(dy′)

(b)= 0.

Step (a) follows by plugging in the definition of h given in (A.19) and the expression for π from

(A.16). Step (b) holds because h ∈ L20(π). The fact that ‖h‖2,γ < ∞ follows by an application

of Jensen’s Inequality and the original assumption ‖h‖2,π <∞. Hence, h ∈ L20(γ).

140

Next we consider the denominator of (A.17).

∫Edh2(y)π(dy) =

∫Edh2((y1, ..., yd))

d∏i=2

r(yi−1, dyi)γ(dx1)

(a)

≥∫E

(∫Ed−1

h((y1, ..., yd))d∏i=2

r(yi−1, dyi)

)2

γ(dy1)

(b)=

∫Eh2(y1)γ(dy1) = ‖h‖22,γ , (A.20)

where step (a) follows from Jensen’s inequality and step (b) uses the definition of h given in

(A.19). Combining (A.18) and (A.20), we have ∀h ∈ L20(π),

‖Ph‖2,π‖h‖2,π ≤

‖Rh‖2,γ‖h‖2,γ

, where h is defined

in (A.19) and we have h ∈ L20(γ) as demonstrated above. Let H ⊂ L2

0(γ) be the collection of

functions defined in (A.19) for all h ∈ L20(π). Then we have

βP = suph∈L2

0(π)

‖Ph‖2,π‖h‖2,π

≤ suph∈H

‖Rh‖2,γ‖h‖2,γ

(a)

≤ suph∈L2

0(γ)

‖Rh‖2,γ‖h‖2,γ

= βR, (A.21)

where step (a) holds because H ⊂ L20(γ).

Finally, let us show βR < 1. By Lemma A.4.1, we have that for each signed measure ν ∈ L2(γ)

with ν(E) = 0, we have

∫E

∣∣∣∣d(νR)

dγ

∣∣∣∣2 dγ ≤ ρ ∫E

∣∣∣∣dνdγ∣∣∣∣2 dγ. (A.22)

Define h := dν/dγ, which is well-defined since ν γ. By the reversibility, we have∫E r(x

′, dx)ν(dx′)

γ(dx)=

∫E

r(x, dx′)ν(dx′)

γ(dx′)=

∫Eh(x′)r(x, dx′),

Therefore, (A.22) can be written as∫E

(∫E h(x′)r(x, dx′)

)2γ(dx) ≤ ρ

∫E(h(x))2γ(dx), for all ν

such that 0 = ν(E) =∫E(ν(dx)/γ(dx))γ(dx) =

∫E h(x)γ(dx). Therefore, βR = suph∈L2

0(γ)‖Rh‖2,γ‖h‖2,γ ≤

ρ < 1. We have shown the result of (A.15) by showing that that βP ≤ βR < 1.

The following lemma shows us that a normalized sum of pseudo-Lipschitz functions with

Markov chain input vectors concentrate at its expected value under certain conditions on the

Markov chain.

Lemma A.4.4. Let Uii∈N be a time-homogeneous, stationary Markov chain on a bounded

141

state space E ⊂ R. Denote the transition probability measure of Uii∈N by r(x, dy) and station-

ary distribution measure by γ. Assume that the Markov chain is reversible and geometrically

ergodic on L2(γ) as defined in Definition 2.5.1.

Define Xii∈[n] as Xi = (Ui, ..., Ui+d−1) ∈ Ed. Let f : Rd → R be a measurable function that

satisfies the pseudo-Lipschitz condition. Then, for all ε ∈ (0, 1), there exists constants K,κ > 0

that are independent of n, ε, such that P(∣∣ 1n

∑ni=1 f(Xi)− Eπf

∣∣ ≥ ε) ≤ Ke−κnε2, where the

probability measure π is defined as π(dx) = π(dx1, ..., dxd) :=∏di=2 r(xi−1, dxi)γ(dx1).

Proof. First, we split Xii∈[n] into d subsequences, each containing every dth term of Xii∈[n],

beginning from 1, 2, . . . , d. Label these X(1)i i∈[n1], ..., X

(d)i i∈[nd] with X(s)

i i∈[ns] := Xs+kd :

k = 1, ..., ns, where ns = bn−d−s+1d c, for s = 1, ..., d.

Notice that∑n

i=1 f(Xi) =∑d

s=1

∑nsi=1 f(X

(s)i ). Using Lemma A.1.1, we have

P

(∣∣∣∣∣ 1nn∑i=1

f(Xi)− Eπf

∣∣∣∣∣ ≥ ε)≤

d∑s=1

P

(∣∣∣∣∣ 1

ns

ns∑i=1

f(X(s)i )− Eπf

∣∣∣∣∣ ≥ nε

dns

). (A.23)

In the following, without loss of generality, we assume Eπf = 0 and demonstrate the upper-

tail bound for X(1)i i∈[n1]:

P

(1

n1

n1∑i=1

f(X(1)i ) ≥ ε

)≤ Ke−κn1ε2 . (A.24)

The lower-tail bound follows similarly, so do the corresponding results for s = 2, 3, . . . , d. To-

gether using (A.23) these provide the desired result. Using the Cramer-Chernoff method: for

r > 0,

P

(1

n1

n1∑i=1

f(X(1)i ) ≥ ε

)= P

(expr

n1∑i=1

f(X(1))i ≥ exprn1ε

)

≤ exp−rn1εE

[expr

n1∑i=1

f(X(1)i )

].

(A.25)

In what follows we will upper bound the expectation E[er∑n1i=1 f(X

(1)i )]

to show (A.24).

Let X(1)i i∈[n1] be an independent copy of X(1)

i i∈[n1]. By Jensen’s inequality, we have

E

[exp

−r

n1∑i=1

f(X(1)i )

]≥ exp

−rE

[n1∑i=1

f(X(1)i )

]= exp

−r

n1∑i=1

E[f(X

(1)i )]

= 1.

142

Therefore,

E

[exp

r

n1∑i=1

f(X(1)i )

]≤ E

[exp

r

n1∑i=1

f(X(1)i )

]E

[exp

−r

n1∑i=1

f(X(1)i )

]

= E

[exp

r

n1∑i=1

(f(X

(1)i )− f(X

(1)i ))]

.

(A.26)

Let Z(1)i := (X

(1)i , X

(1)i ), and g(Z

(1)i ) := f(X

(1)i )− f(X

(1)i ) for i = 1, 2, . . . , n1. We have shown

E[expr∑n1

i=1 f(X(1)i )] ≤ E[expr

∑n1i=1 g(Z

(1)i )] and therefore, in what follows we provide an

upper bound for E[expr∑n1

i=1 g(Z(1)i )], which can be used in (A.25).

To bound E[expr∑n1

i=1 g(Z(1)i )], we begin by demonstrating some properties of the se-

quence Z(1)i i∈[n1], which will be used in the proof. By construction, Z(1)

i i∈[n1] is a time-

homogeneous Markov chain on state space D = Ed × Ed. Denote its marginal probability

measure by µ and transition probability measure by q(z, dz′). In order to obtain more useful

properties, it is helpful to relate Z(1)i i∈[n1] to the original Markov chain Uii∈N, which we

have assumed to be reversible and geometrically ergodic.

The construction of Z(1)i i∈[n1] can alternatively be thought of as follows. Let Uii∈N

be an independent copy of Uii∈N. Then by Lemma A.4.2, (Ui, Ui)i∈N is reversible and

geometrically ergodic. Also notice that the elements of Z(1)i i∈[n1] consist of successive non-

overlapping elements of (Ui, Ui)i∈N, same as the construction of Yii∈N in Lemma A.4.3.

Therefore, the results in Lemma A.4.3 imply that the marginal probability measure µ is a

stationary measure of the transition probability measure q(z, dz′). Moreover, the linear operator

Q defined as

Qh(z) :=

∫Dh(z′)q(z, dz′) (A.27)

satisfies:

βQ := suph∈L2

0(µ)

‖Qh‖2,µ‖h‖2,µ

< 1. (A.28)

With the result βQ < 1, we are now ready to bound E[expr∑n1

i=1 g(Z(1)i )], where we will use

a method similar to the one introduced in [70, Section 4].

Define m(z) := exp (rg(z)), for all z ∈ D, and so we can represent the expectation that we

hope to upper bound in the following way:

E[exprn1∑i=1

g(Z(1)i )] = E

[n1∏i=1

m(Z(1)i )

]. (A.29)

143

To provide an upper bound for (A.29), we first define a sequence aii∈[n1] as a0 = 1 and

ai = E[expri∑

j=1

g(Z(1)j )] = E

i∏j=1

m(Z(1)j )

, for 1 ≤ i ≤ n1. (A.30)

Note then that an1 equals the expectation in (A.29) and we have

an1 = E

[n1∏i=1

m(Z(1)i )

](a)=

∫Dn1

µ(dz1)m(z1)

n1∏i=2

q(zi−1, dzi)m(zi)

=

∫Dn1−1

µ(dz1)m(z1)

n1−1∏i=2

q(zi−1, dzi)m(zi)

∫Dq(zn1−1, dzn1)m(zn1).

(A.31)

In step (a) we use the fact that Z(1)i i∈[n1] is a Markov Chain in its stationary distribution,

µ, with probability transition measure q(z, dz′). Now, let b1 := Eµm, which is a constant value,

and m1 := m− b1. Then m(zn1) = b1 +m1(zn1), and so it follows from (A.31),

an1 =

∫Dn1−1

µ(dz1)m(z1)

n1−1∏i=2

q(zi−1, dzi)m(zi)

∫Dq(zn1−1, dzn1) (b1 +m1(zn1))

= b1

∫Dn1−1

µ(dz1)m(z1)

n1−1∏i=2

q(zi−1, dzi)m(zi)

+

∫Dn1−1

µ(dz1)m(z1)

n1−1∏i=1

q(zi−1, dzi)m(zi)

∫Dq(zn1−1, dzn1)m1(zn1)

(b)= an1−1b1 +

∫Dn1−1

µ(dz1)m(z1)

n1−1∏i=2

q(zi−1, dzi)m(zi)Qm1(zn1−1). (A.32)

Step (b) uses the definition of an1−1 given in (A.30) and the linear operator defined in (A.27).

Now consider the integral in (A.32), which we split as in (A.31) in the following:

∫Dn1−1

µ(dz1)m(z1)

n1−1∏i=2

q(zi−1, dzi)m(zi)Qm1(zn1−1)

=

∫Dn1−2

µ(dz1)m(z1)

n−2∏i=2

q(zi−1, dzi)m(zi)

∫Dq(zn1−2, dzn1−1)m(zn1−1)Qm1(zn1−1). (A.33)

Then by defining b2 := Eµ [mQm1], m2 := mQm1 − b2, and using (A.33), we can represent an1

144

as the following in a similar way as in (A.32):

an1 = an1−1b1 + an1−2b2 +

∫Dn1−2

µ(dz1)m(z1)

n−2∏i=2

q(zi−1, dzi)m(zi)Qm2(zn1−2). (A.34)

Continuing in this way – defining constant values bi := Eµ [mQmi−1] and mi := mQmi−1 − bifor i = 2, ..., n1, then splitting the integral as in (A.33) – we represent an1 recursively as

an1 =∑n1

i=1 bian1−i.

Again, our goal is to provide an upper bound for an1 which we can establish through the

recursive relationship an1 =∑n1

i=1 bian1−i if we can upper bound b1, ..., bn1 . First consider b1.

Let Z ∼ µ.

b1 = E [exprg(Z)] = E

[limn→∞

n∑k=0

rk

k!(g(Z))k

].

Consider the partial sum∑n

k=0rk

k! (g(Z))k. Moreover, notice that

supz∈D|g(z)| = sup

x∈Edsupx∈Ed

|f(x)− f(x)|(a)

≤ supx∈Ed

supx∈Ed

L(1 + ‖x‖+ ‖x‖)‖x− x‖

(b)

≤ L(1 + 2√dM)(2

√dM),

where step (a) holds since f(·) is pseudo-Lipschitz with constant L and step (b) due to ‖x−x‖ ≤‖x‖+ ‖x‖ and the boundedness of Ed: ‖x‖ ≤ M

√d for some constant M > 0 and all x ∈ Ed.

Let Mg = L(1 + 2√dM)(2

√dM). Then for each n,

n∑k=0

rk

k!(g(Z))k ≤ sup

z∈D

n∑k=0

rk

k!|g(z)|k ≤

n∑k=0

rk

k!Mkg ≤

∞∑k=0

rk

k!Mkg = exprMg.

Since the constant exprMg is integrable w.r.t. any proper probability measure, we have

b1 = E

[limn→∞

n∑k=0

rk

k!(g(Z))k

](a)= lim

n→∞

n∑k=0

rk

k!E[(g(Z))k]

(b)

≤ 1 + E[(g(Z))2]∞∑k=2

rkMk−2g

k!= 1 +

r2E[(g(Z))2]

2

∞∑k=2

(rMg)k−2

k!/2

(c)

≤ 1 +r2E[(g(Z))2]

2

∞∑k=2

(rMg)k−2

(k − 2)!= 1 +

r2E[(g(Z))2]

2exprMg,

(A.35)

145

where step (a) follows the dominated convergence theorem, step (b) holds since E[g(Z)] = 0

and E[(g(Z))k] ≤Mk−2g E[(g(Z))2], and step (c) holds since (k − 2)! = k!/(k(k − 1)) ≤ k!/2 for

k ≥ 2 with the convention 0! = 1.

Next we’ll bound bi for i = 2, 3, . . .. To do this we first establish an upper bound on ‖mi‖2,µwith the norm defined in (A.28),

‖mi‖2,µ = ‖mQmi−1 − bi‖2,µ =√‖mQmi−1‖22,µ − b2i ≤ ‖mQmi−1‖2,µ

(a)

≤ exprMg‖Qmi−1‖2,µ(b)

≤ exprMgβQ‖mi−1‖2,µ.

Step (a) holds since supz∈Dm(z) = supz∈D exprg(z) ≤ exprMg. Step (b) holds since

Eµmi = 0, for all i = 1, ..., n by construction, and so ‖Qmi‖2,µ ≤ βQ‖mi‖2,µ by (A.28). Hence,

extending the above result recursively, we find

‖mi‖2,µ ≤ (exprMgβQ)i−1‖m1‖2,µ. (A.36)

Let 〈f1, f2〉µ =∫f1(z)f2(z)µ(dz). We use this to bound bi in the following by noting that

bi = Eµ[mQmi−1] = 〈m,Qmi−1〉µ = 〈m1+b1, Qmi−1〉µ = 〈m1, Qmi−1〉µ, where the last equality

holds because

〈b1, Qmi−1〉µ = b1

∫z∈D

Qmi−1(z)µ(dz) = b1

∫z∈D

∫z′∈D

mi−1(z′)q(z, dz′)µ(dz)

(a)= b1

∫z′∈D

mi−1(z′)

∫z∈D

q(z, dz′)µ(dz)

(b)= b1

∫z∈D

mi−1(z′)µ(dz′)(c)= 0.

In the above, step (a) follows from Fubini’s Theorem, step (b) follows from the fact that µ is

the stationary distribution of q(z, dz′), and step (c) follows from the construction of mi’s, which

says that Eµmi = 0, for i = 2, 3, . . .. Then,

bi = 〈m1, Qmi−1〉µ(c)

≤ ‖m1‖2,µ‖Qmi−1‖2,µ(d)

≤ βQ(βQerMg)i−2‖m1‖22,µ, (A.37)

where step (c) follows Cauchy-Schwarz inequality and step (d) follows from the fact that

‖Qmi−1‖2,µ ≤ βQ‖mi−1‖2,µ by (A.28) and (A.36). Now let Z ∼ µ and we bound ‖m1‖22,µ

146

as follows

‖m1‖22,µ = E[e2rg(Z)]− (E[erg(Z)])2(f)

≤ 1 + 2r2E[(g(Z))2]e2rMg − e2rE[g(Z)]

(g)= 2r2E[(g(Z))2]e2rMg , (A.38)

where step (f) uses similar approach to that used to bound b1 in (A.35) and Jensen’s inequality,

and step (g) follows since E[g(Z)] = 0.

Therefore, from (A.35), (A.37), and (A.38) we have

b1 ≤ 1 +r2E[(g(Z))2]

2exprMg and bi ≤ βQ(βQ exprMg)i−22r2E[(g(Z))2] exp2rMg.

(A.39)

Let X, X ∼ π independent. Notice that

E[(g(Z))2] = E[(f(X)− f(X))2](a)

≤ L2E[((1 + ‖X‖+ ‖X‖)‖X − X‖)2]

(b)

≤ 5L2(

2E[‖X‖2] + 2E[‖X‖4] + 4E[‖X‖2]E[‖X‖2])

(c)

≤ 10L2

(d∑i=1

E[X2i ] + d

d∑i=1

E[X4i ] + 2

(d∑i=1

E[X2i ]

)(d∑i=1

E[X2i ]

))(d)= 10L2

(dm2 + d2m4 + 2d2m2

2

),

where step (a) holds since f(·) is pseudo-Lipschitz with constant L > 0, step (b) uses ‖X−X‖ ≤‖X‖ + ‖X‖, Lemma A.2.5, and the fact that X and X are i.i.d., step (c) uses Lemma A.2.5,

and in step (d), m2 and m4 denote the second and fourth moment of γ, respectively. Because γ

is defined on a bounded state space, m2 and m4 are finite.

Let b2 = 10L2(dm2 + d2m4 + 2d2m2

2

), a = 1

2b2 exprMg, and α = βQ exprMg. Choose

r < (1 − βQ)/Mg, then we have 0 < α < 1 since 1 − βQ < − lnβQ. Using these bounds and

notation, (A.39) becomes

b1 ≤ 1 + ar2 and bi ≤ αi−14ar2. (A.40)

We now bound a1, ..., an1 by induction. We will show ai ≤ [φ(r)]i, where φ(r) = 1 +Cr2 for

some C ≥ 4a that is independent of i. For i = 1, a1 = b1 ≤ 1 + 4ar2. Hence, the hypothesis

147

ai ≤ [φ(r)]i is true for i = 1. Suppose that the hypothesis is true for i ≤ n1 − 1, then

an1 = b1an1−1 +

n1∑i=2

bian1−i ≤ (1 + 4ar2)[φ(r)]n1−1 +

n1∑i=2

4ar2αi−1[φ(r)]n1−i, (A.41)

where the final inequality in the above follows by (A.40) and the inductive hypothesis. Consider

only the second term on the right side of (A.41),

n1∑i=2

4ar2αi−1[φ(r)]n1−i = 4ar2αn1−1n1∑i=2

[α−1φ(r)]n1−i

= 4ar2αn1−1

(1−

(φ(r)α−1

)n1−1

1− φ(r)α−1

)= 4ar2

(α[φ(r)]n1−1 − αn1

φ(r)− α

)≤ 4ar2α[φ(r)]n1−1

φ(r)− α,

where the final inequality follows since a, α > 0. Then plugging the above result into (A.41),

we find

an1 ≤ (1 + 4ar2)[φ(r)]n1−1 +4ar2α[φ(r)]n1−1

φ(r)− α

≤ [φ(r)]n1−1

(1 +

4ar2φ(r)

φ(r)− α

)≤ [φ(r)]n1−1

(1 +

4ar2

1− α

),

where the final inequality follows since φ(r) ≥ 1. Therefore, let C = 4a(1 − α)−1 > 4a, since

0 < α < 1, and so φ(r) = 1 + 4ar2(1− α)−1. It follows from the above then,

an1 ≤(

1 +4ar2

1− α

)n1

= en1 ln(1+4ar2(1−α)−1) ≤ en14ar2(1−α)−1, (A.42)

where the final inequality uses the fact that ln(1 + x) ≤ x for x ≥ 0.

Finally, from (A.25), (A.26), and the bound in (A.42),

P

(1

n1

n1∑i=1

f(X(1)i ) ≥ ε

)≤ exp

(−n1

(rε− 4ar2(1− α)−1

))(a)= exp

(−n1

(rε− 2b2r2erMg

1− βQerMg

)),

where step (a) follows from the fact that a = b2erMg/2 and α = βQerMg . Now let us consider the

term in the exponent in the above for the cases where (i) b2 ≥Mg and (ii) b2 < Mg separately,

and then combine the results in the two cases to obtain a desired bound for all ε ∈ (0, 1).

148

First (i) b2 ≥ Mg. Notice for every 0 < ε < 4b2/Mg, if we let r = (1 − βQ)ε/(4b2), then

r < (1 − βQ)/Mg as required before. We show that whenever 0 < ε ≤ b2/Mg, we can obtain a

desired bound. Then the condition in the lemma statement, ε ∈ (0, 1), falls within this effective

region,

rε− 2b2r2erMg

1− βQerMg

(a)=

(1− βQ)ε2

4b2 −(1− βQ)2ε2

8b2 ·exp

((1−βQ)Mgε

4b2

)1− βQ exp

((1−βQ)Mgε

4b2

)=

(1− βQ)ε2

8b2

1−exp

((1−βQ)Mgε

4b2

)− 1

1− βQ exp(

(1−βQ)Mgε

4b2

)

(b)

≥(1− βQ)ε2

8b2

1−(1−βQ)Mgε

3b2

1− βQ(

1 +(1−βQ)Mgε

3b2

)

=(1− βQ)ε2

8b2

(1− Mgε

2b2 + (b2 − βQMgε)

)(c)

≥(1− βQ)ε2

8b2

(1− ε

2

), for 0 < ε ≤ b2/Mg.

In the above, step (a) holds by plugging in r = (1−βQ)ε/(4b2), step (b) holds since ex ≤ 1+4x/3

for x ≤ 1/2, and step (c) holds since ε ≤ b2/Mg, so (b2 − βQMgε) > 0, and b2 ≥Mg.

Next consider (ii) b2 < Mg. In this case, set r = (1− βQ)ε/(4Mg). Hence, r < (1− βQ)/Mg

for ε ∈ (0, 1), and then

rε− 2b2r2erMg

1− βQerMg

(a)> rε− 2Mgr

2erMg

1− βQerMg

(b)=

(1− βQ)ε2

4Mg−

(1− βQ)2ε2

8Mg·

exp(

(1−βQ)ε4

)1− βQ exp

((1−βQ)ε

4

)=

(1− βQ)ε2

8Mg

1−exp

((1−βQ)ε

4

)− 1

1− βQ exp(

(1−βQ)ε4

)

(c)

≥(1− βQ)ε2

8Mg

(1− ε

2

), for 0 < ε ≤ 1.

In the above, step (a) holds since b2 < Mg, step (b) by plugging in r = (1 − βQ)ε/(4Mg), and

step (c) follows similar calculation as in case (i).

Combining the results in the two cases, we conclude that for all ε ∈ (0, 1), the following is

149

satisfied:

rε− 2b2r2erMg

1− βQerMg≥

(1− βQ)ε2

8 max(Mg, b2)

(1− ε

2

).

Hence, for ε ∈ (0, 1),

P

(1

n1

n1∑i=1

f(X(1)i ) ≥ ε

)≤ exp

(−(1− βQ)n1ε

2

8 max(Mg, b2)

(1− ε

2

))≤ exp

(−(1− βQ)n1ε

2

16 max(Mg, b2)

). (A.43)

Therefore, using (A.23) and the fact that we can show a similar result for each s = 2, 3, . . . , d,

we have for ε ∈ (0, 1),

P

(∣∣∣∣∣ 1nn∑i=1

f(Xi)− Eπf

∣∣∣∣∣ ≥ ε)≤

d∑s=1

P

(∣∣∣∣∣ 1

ns

ns∑i=1

f(X(s)i )− Eπf

∣∣∣∣∣ ≥ nε

dns

)(a)

≤d∑s=1

exp

(−(1− βQ)n2ε2

16ns max(Mg, b2)

)(b)

≤ d exp

(−(1− βQ)nε2

16dmax(Mg, b2)

), (A.44)

where step (a) follows (A.43) and step (b) holds since n/ns ≥ n/n1 = n/(bn/dc − 1) ≥ d, for

all s ∈ [d]. To complete the proof, we remind the reader that b2 = 10L2(dm2 + d2m4 + 2d2m2

2

)and Mg = L(1 + 2

√dM)(2

√dM).

150

Appendix B

Chapter 3 Appendix

B.1 Derivation of (3.4)

To simplify notation, we drop the superscript (l), which denotes the subsequence index, in

the following derivation. We follow the derivation in [41]. Denoting θ = αr, µr, σ2rRr=1, the

MML-based criterion is

L(s, θ) =m

2

∑r:αr>0

log(Nαr) +Rnz

2log(N)− log (f(s|θ)) , (B.1)

where m = 2 is the number of parameters per Gaussian component, and Rnz is the number

of components with nonzero mixing probability αr. The first term is the coding length of

µr, σ2rRnzr=1, because the expected number of data points that are from the rth component is

Nαr, hence the effective sample size for estimating µr, σ2r is Nαr. The second term is the

coding length of αr’s, because αr’s are estimated from N data points. The third term is the

coding length of the data sequence s. The complete data expression for log (p(s|θ)) is

log (p(s, x, z|θ)) =

N∑i=1

log(p(si, xi, z

(1)i , ..., z

(R)i |θ)

)=

N∑i=1

log(p(si, xi|z(1)

i , ..., z(R)i , θ)p(z

(1)i , ..., z

(R)i ))

=

N∑i=1

log

(R∏r=1

p(si, xi|µr, σ2r)z

(r)i

R∏r=1

αz

(r)ir

)=

N∑i=1

R∑r=1

z(r)i log

(αrp(si, xi|µr, σ2

r))

=

N∑i=1

R∑r=1

z(r)i log

(αrp(si|xi)p(xi|µr, σ2

r))

=N∑i=1

R∑r=1

z(r)i log

(αrN (si;xi, σ

2v)N (xi;µr, σ

2r )).

151

Replace log (p(s|θ)) in (B.1) with log (p(s, x, z|θ)):

L(s, x, z, θ) =m

2

∑r:αr>0

log(Nαr) +Rnz

2log(N)−

N∑i=1

R∑r=1

z(r)i log

(αrN (si;xi, σ

2v)N (xi;µr, σ

2r )).

Let θ(t) = (α1(t), ..., αR(t), µ1(t), ..., µR(t), σ21(t), ..., σ2

R(t)) be the estimate of θ at iteration t.

E[log (p(s, x, z|θ)) |s, θ(t)] = C +N∑i=1

R∑r=1

log (αr)E[Z

(r)i |s, θ(t)

]+

N∑i=1

R∑r=1

E[Z

(r)i

(−1

2log(2πσ2

v)−(Xi − si)2

2σ2v

)|s, θ(t)

]

+

N∑i=1

R∑r=1

E[Z

(r)i

(−1

2log(2πσ2

r )−(Xi − µr)2

2σ2r

)|s, θ(t)

]

=N∑i=1

R∑r=1

log (αr)E[Z

(r)i |s, θ(t)

]− 1

2

N∑i=1

R∑r=1

(log(2πσ2

r )E[Z

(r)i |s, θ(t)

]+

1

σ2r

E[Z

(r)i (Xi−µr)2|s, θ(t)

]),

where C is a constant that does not depend on θ.

E[Z

(r)i |s, θ(t)

]= P

(Z

(r)i = 1|s, θ(t)

)=

αr(t)N (si; µr(t), σ2v + σr(t)

2)∑Rm=1 αm(t)N (si; µm(t), σ2

v + σm(t)2),

E[Z

(r)i Xi|s, θ(t)

]= E

[Xi|Z(r)

i = 1, s, θ(t)]P(Z

(r)i = 1|s, θ(t)

)=

(σ2r (t)

σ2r (t) + σ2

v

(si − µr(t)) + µr(t)

)P(Z

(r)i = 1|s, θ(t)

),

E[Z

(r)i X2

i |s, θ(t)]

= E[X2i |Z

(r)i = 1, s, θ(t)

]P(Z

(r)i = 1|s, θ(t)

)=

(σ2v σ

2r (t)

σ2r (t) + σ2

v

+

(σ2r (t)

σ2r (t) + σ2

v

(si − µr(t)) + µr(t)

)2)P(Z

(r)i = 1|s, θ(t)

).

Denote w(r)i (t) = E

[Z

(r)i |s, θ(t)

], a

(r)i (t) = σ2

r(t)σ2r(t)+σ2

v(si − µr(t)) + µr(t), and v

(r)i (t) = σ2

vσ2r(t)

σ2r(t)+σ2

v.

Then

E[log (p(s, x, z|θ)) |s, θ(t)] = C +

N∑i=1

R∑r=1

log (αr)w(r)i (t)

− 1

2

N∑i=1

R∑r=1

(log(2πσ2

r )w(r)i (t) +

w(r)i (t)

σ2r

(v

(r)i (t) +

(a

(r)i (t)− µr(t)

)2))

.

152

Therefore,

E[L(s, X, Z, θ)|s, θ(t)] = C ′ +m

2

∑r:αr>0

log(αr)−N∑i=1

∑r:αr>0

log (αr)w(r)i

+1

2

N∑i=1

∑r:αr>0

log(2πσ2r )w

(r)i (t) +

1

2

N∑i=1

∑r:αr>0

w(r)i (t)

σ2r

(v

(r)i (t) +

(a

(r)i (t)− µr(t)

)2), (B.2)

where C ′ is a constant that does not depend on θ.

Denote Q(θ, θ(t)) = E[L(s, x, z, θ)|s, θ(t)]. Setting the partial derivative of Q(θ, θ(t)) w.r.t.

µr and σ2r , respectively, to zero, we obtain

µr(t+ 1) =

∑Ni=1w

(r)i (t)a

(r)i (t)∑N

i=1w(r)i (t)

,

σ2r (t+ 1) =

∑Ni=1w

(r)i (t)

(v

(r)i (t) +

(a

(r)i (t)− µr(t+ 1)

)2)

∑Ni=1w

(r)i (t)

.

To estimate αr, notice that we have the constraints 0 ≤ αr ≤ 1,∀r and∑

r=1 αr = 1.

Collecting the terms in (B.2) that contain αr, we have

∑r:αr>0

log(αr)

(m

2−

N∑i=1

w(r)i

)= − log

∏r:αr>0

α

N∑i=1

w(r)i −

m2

r

,

which is the negative log likelihood of a quantity that is proportional to a Dirichlet pdf of

(αr, ..., αRnz), and its mode appears at

αr =

N∑i=1

w(r)i −

m2

Rnz∑r=1

(N∑i=1

w(r)i −

m2

) , N∑i=1

w(r)i −

m

2> 0.

Hence,

αr(t+ 1) =

max N∑i=1

w(r)i (t)− m

2 , 0

Rnz∑r=1

max N∑i=1

w(r)i (t)− m

2 , 0 .

153

Appendix C

Chapter 4 Appendix

C.1 Proof of Lemma 4.1.2

To simplify the notation, we drop the superscript t or t1 in the following proof. That is, in the

following, Ql = Qt1l , Yl = Yt1

l , Ml = Mtl , and Xl = Xt

l .

First let us consider projections of a deterministic matrix. Let Al be a deterministic matrix

that satisfies the linear constraints Yl = AlQl and Xl = A∗lMl, then we have

Al = ApQl(Q∗lQl)

−1Q∗l + Al(I−Ql(Q∗lQl)

−1Q∗l ),

Al = Ml(M∗lMl)

−1M∗l Al + (I−Ml(M

∗lMl)

−1M∗l )Al.

Combining the two equations above, as well as the two linear constraints, we can write

Al = Yl(Q∗lQl)

−1Q∗l + Ml(M∗lMl)

−1Xl −Ml(M∗lMl)

−1M∗lYl(Q

∗lQl)

−1Q∗l + P⊥MlAlP

⊥Ql.

(C.1)

We now demonstrate the conditional distribution of A1, ...,AL. Let S1, ..., SL be arbitrary

Borel sets on Rn×N1 ,...,Rn×NL , respectively.

P (A1 ∈ S1, ...,AL ∈ SL |AlQl = Yl,A∗lMl = Xl,∀l ∈ [L] )

(a)= P

(Et1,t1 + P⊥M1

A1P⊥Q1∈ S1, ...,E

t1,tL + P⊥ML

ALP⊥QL∈ SL |AlQl = Yl,A

∗lMl = Xl, ∀l ∈ [L]

)(b)= P

(Et1,t1 + P⊥M1

A1P⊥Q1∈ S1, ...,E

t1,tL + P⊥ML

ALP⊥QL∈ SL

)= P

(Et1,t1 + P⊥M1

A1P⊥Q1∈ S1

)...P

(Et1,tL + P⊥ML

ALP⊥QL∈ SL

), (C.2)

154

which implies the desired result. In step (a),

Et1,tl = Yl(Q∗lQl)

−1Q∗l + Ml(M∗lMl)

−1X∗l −Ml(M∗lMl)

−1M∗lYl(Q

∗lQl)

−1Q∗l , l = 1, ..., L,

which follows from (C.1). Step (b) holds since P⊥MlAlP

⊥Ql

is independent of the conditioning.

The independence is demonstrated as follows. Notice that AlQl = AlP||Ql

Ql. In what follows,

we will show that A||l := AlP

||Ql

is independent of A⊥r := ArP⊥Qr

, for l, r = 1, ..., L. Then

similar approach can be used to demonstrate that P⊥MlAl is independent of P

||Mr

Ar. Together

they provide the justification for step (b). Note that A||l and A⊥r are jointly normal, hence it is

enough to show they are uncorrelated.

E

[A||l ]i,j [A

⊥r ]m,l

= E

(N∑k=1

[Al]i,k[P‖Ql

]k,j

)(N∑k=1

[Ar]m,k

(Ik,l − [P

‖Qr

]k,l

))(a)=

1

nδ0(i,m)δ0(l, r)

(N∑k=1

[P‖Ql

]k,j Ik,l −N∑k=1

[P‖Ql

]k,j [P‖Qt1r

]k,l

)(b)=

1

nδ0(i,m)δ0(l, r)

([P‖Ql

]l,j −N∑k=1

[P‖Ql

]k,j [P‖Qr

]l,k

)(c)= 0,

where δ0(i, j) is the Kronecker delta function. In the above, step (a) holds since the original

matrix A has N (0, 1/n) i.i.d. entries, step (b) holds since projectors are symmetric matrices,

and step (c) follows the property of projectors P2 = P.

155


We will show that with the problem setting described in Section 4.2.2.1, y1 is asymptotically

independent in the limit of large N . The subscript that represents Part 1 is dropped in the

following analysis. Denote the characteristic function of x by Ψx(t) = E[eitx]. We will show

that for any constant m ≤ n,

limN→∞

Ψy1...ym(t1, ..., tm) = limN→∞

Ψy1(t1)...Ψym(tm), (C.3)

where Ψy1...ym(t1, ..., tm) = E[eit1y1+...+itcym

]is the joint characteristic function, and expecta-

tion is taken w.r.t. the joint probability density P (y1, ..., ym). The joint characteristic function

can be factorized as the product of the marginal characteristic functions as described in (C.3)

if and only if y1, ..., ym are independent [2].

To lighten the notation, we assume that the nonzero entries of the Bernoulli matrix A are

ones (we adjusted the nonzero entries in Section 4.2 to make the input SNR in Parts 1 and 2

identical), and we ignore the i.i.d. measurement noise.1 Under these simplifications, the signal

model is

y = Ax = A∗1x1 + A∗2x2 + ...+ A∗NxN ,

where A∗j represents the jth column of A. Define a sequence of random vectors vj = A∗jxj ,

j ∈ [N ]. Notice that vjNj=1 are i.i.d. random vectors, and thus the characteristic function of

the first m entries of y is

Ψy(t1, ..., tm) = (Ψv1(t1, ..., tm))N . (C.4)

To establish (C.3), in the following, we will show that

limN→∞

Ψy1(t1)...Ψym(tm) = ed

(e−

12 t

21+...+e−

12 t

2m−m

)(C.5)

and

limN→∞

Ψy1...ym(t1, ..., tm) = ed

(e−

12 t

21+...+e−

12 t

2m−m

). (C.6)

First, we show (C.5). For m = 1, v1 is a scalar. Recall that the Bernoulli parameter of the

Bernoulli matrix in Part 1 is dsN and the sparsity rate of x is s. Let g(x) denote the pdf of a

Gaussian random variable x with mean 0 and variance 1. Denoting the probability distribution

1Note that if entries of y = Ax are independent, then after adding an i.i.d. noise vector w, entries of y′ = y+ware still independent. Therefore, these simplifications do not affect the independence relation among entries of y.

156

of v1 by Pv1(u1), we have

Pv1(u1) =d

Ng(u1) +

(1− d

N

)δ0(u1), and Ψv1(t1) = 1− d

N+

d

Ne−

12t21 .

Therefore,

limN→∞

Ψy1(t1) = limN→∞

(1 +

d

N

(e−

12t21 − 1

))N= e

d

(e−

12 t

21−1

),

where the first equation follows from (C.4). Because limN→∞

Ψyi(ti) exists for every i ∈ [n], for

any finite constant m, we have

limN→∞

Ψy1(t1)...Ψym(tm) = limN→∞

Ψy1(t1)... limN→∞

Ψym(tm) = ed

(e−

12 t

21+...+e−

12 t

2m−m

).

Hence, (C.5) is verified.

Next, we show (C.6). For m = 2, v1 is a vector of length 2. Denoting the probability

distribution of v1 by Pv1(u1, u2), we have

Pv1(u1, u2) = Pv1 (u1, u2 | x1 = 0)P (x1 = 0) + Pv1 (u1, u2 | x1 6= 0)P (x1 6= 0)

=

(1 +

d2

sN2− 2d

N

)δ0(u1)δ0(u2) +

(d

N− d2

sN2

)(δ0(u1)g(u2) + δ0(u2)g(u1)

)

+d2

sN2δ0(u2 − u1)g(u1).

Ψv1(t1, t2) = 1 +d

N

(e−

12t21 + e−

12t22 − 2

)+

d2

sN2

(e−

12

(t1+t2)2 − e−12t21 − e−

12t22 + 1

).

Therefore, by (C.4),

limN→∞

Ψy1y2(t1, t2) = limN→∞

(Ψv1(t1, t2))N = ed

(e−

12 t

21+e−

12 t

22−2

).

Similarly, it can be shown for any m ≤ n that

limN→∞

Ψy1...ym(t1, ..., tm) = limN→∞

(Ψv1(t1, ..., tm))N = ed

(∑ci=1 e

− 12 t

2i −m

).

Therefore, (C.6) is also verified, which establishes (C.3).

We conclude that in the limit of largeN , for each j ∈ [N ], the indicator variables Ii,jn,Ni=1,j=1

157

are independent along i. Therefore, Sj =∑n

i=1 Ii,j converges to a Binomial random variable SB

in distribution [64], where SB ∼ Binomial (n, Pε,d(xj)).


To simplify the notation, we drop the subscripts of [A2]FA and xFA, and let A represent the

submatrix formed by columns of A2 at the indices FA, and x represent coordinates of x at the

indices FA, where FA represents the false alarms defined in Section 4.2.2.2. Define a sequence

of vectors uj = A∗jxj , j ∈ [|FA|].We notice that vFA is a sum of i.i.d. random vectors, and the components in each vector are

uncorrelated. That is,

vFA =

|FA|∑j=1

uj , and E[[uj ]t[uj ]s] = E[(xjAj,t)(xjAj,s)] =

0, if s 6= t

E[[xFA]2j ]/N, if s = t.

The proof is completed by applying Lemma C.3.1.

Lemma C.3.1. [64, Multivariate Central Limit Theorem] Let (Xn)n∈N be i.i.d. random vectors

on Rd with E[[Xn]i] = 0 and E[[Xn]i[Xn]j ] = Ci,j, ∀i, j ∈ [d]. Let S∗n =X1+...+Xn√

n. Then

PS∗n → N (0, C) in distribution.

By Lemma C.3.1, the distribution of the vector vFA converges to N (0, Cvv), where Cvv is

a diagonal covariance matrix with E[‖xFA‖2]/N on its diagonal. Therefore, vFA converges to an

i.i.d. Gaussian random vector in distribution.

158

Appendix D

Chapter 5 Appendix

D.1 Proof of Proposition 5.3.1

The gradient of D(·) defined in (5.17) is ReJHAz

, where JA is the Jocobian matrix of the

nonlinear operator A(·) defined in (5.15). Recall that L = (I − Gdiag(x)) and u = L−1uin,

hence both L and u are functions of x. We omit such dependencies in our notation for brevity.

Following the chain rule of differentiation,

∂Ab∂xa

=∂

∂xa

N∑i=1

Hb,ixiui = Hb,aua +N∑i=1

[∂ui∂xa

]Hb,ixi.

Using the definition z = A(x)− y and summing over b = 1, ..., n,

[∇D(x)]a =

n∑b=1

[∂Ab∂xa

]zb = ua

n∑b=1

Hb,azb +

N∑i=1

[∂ui∂xa

]xi

n∑b=1

Hb,izb

= ua

[HHz

]a

+N∑i=1

[∂ui∂xa

]xi

[HHz

]i, (D.1)

where a denotes the complex conjugate of a ∈ C. Labeling the two terms in (D.1) as T1 and

T2, we have

T1 =[diag(u)HHHz

]a, (D.2)

and

159

T2(a)= (uin)H

[∂L−1

∂xa

]Hdiag(x)HHz

(b)= −(uin)HL−H

[∂L

∂xa

]HL−Hdiag(x)HHz

(c)= −uH(x)

[∂L

∂xa

]Hv

(d)= [diag(u)HGHv]a. (D.3)

In the above, step (a) holds by plugging in ui =[L−1uin

]i. Step (b) uses the identity

∂L−1

∂xa= −L−1 ∂L

∂xaL−1, (D.4)

which follows by differentiating both sides of LL−1 = I,

∂L

∂xaL−1 + L

∂L−1

∂xa= 0.

From step (b) to step (c), we used the fact that u = L−1uin and defined v := L−Hdiag(x)HHz,

which matches (5.25). Finally, step (d) follows by plugging in L = I −Gdiag(x). Combining

(D.1), (D.2), and (D.3), we have obtained the expression in (5.24).

D.2 Proof of Proposition 5.3.1

The derivation is similar to the scalar field case derived in Appendix D.1. Let H=[H(1)|H(2)|H(3)],

where H(i) ∈ Cn×N for i = 1, 2, 3. Following the chain rule of differentiation, we have

∂Ab∂xa

=3∑

k=1

H(k)b,a u

(k)a +

3∑k=1

N∑i=1

[∂u

(k)i

∂xa

]H

(k)b,i xa.

Using the definition z = A(x)− y and summing over b = 1, ..., n,

[∇D(x)]a =n∑b=1

[∂Ab∂xa

]zb =

3∑k=1

u(k)a

[(H(k))Hz

]a

+3∑

k=1

N∑i=1

[∂u

(k)i

∂xa

]xa

[(H(k))Hz

]i. (D.5)

160

Labeling the two terms in (D.5) as T1 and T2, we have

T1 =3∑

k=1

[diag(u(k))H(H(k))Hz

]a, (D.6)

and

T2(a)=

3∑k=1

([∂L−1

∂xauin

](k))H

diag(x)(H(k))Hz

(b)=

3∑k=1

[−L−1 ∂L

∂xau

](k)H

diag(x)(H(k))Hz

(c)=

3∑k=1

−uH

[∂L

∂xa

]Hv

(k)

(d)=

3∑k=1

[diag(u(k))HGH[(k2I + DH)v](k)

]a. (D.7)

In the above, step (a) follows from u(k)i =

[L−1uin

](k)

i. Step (b) follows from (D.4). In step

(c), we defined v := L−H(I3 ⊗ diag(x))HHz, which matches (5.27). Finally, step (d) follows by

plugging in the definition of L in (5.14). Combining (D.5), (D.6), and (D.7), we have obtained

the expression in (5.28).

D.3 Convergence Analysis

D.3.1 Definitions and Standard Results

We first introduce some definitions and notation, as well as several standard results that will

be used for the convergence analysis of relaxed FISTA. The proofs for the standard results are

provided for completeness.

Definition D.3.1. [101, Definition 8.3 (subgradients)] Consider a function f : Rm → (−∞,∞]

and a point x ∈ Rm where f(x) is finite. For a vector d ∈ Rm, we say that

161

1. d is a regular subgradient of f at x, written as d ∈ ∂f(x), if

lim infx→xx6=x

f(x)− f(x)− 〈d,x− x〉‖x− x‖

≥ 0.

2. d is a (general) subgradient of f at x, written as d ∈ ∂f(x), if there exist xk → x,

f(xk)→ f(x), and dk ∈ ∂f(xk) with dk → d.

Remark D.3.1. The sets ∂f(x) and ∂f(x) are also known as the Frechet subdifferential and

limiting subdifferential of f at x, respectively, and we have ∂f(x) ⊂ ∂f(x) for all x ∈ Rm such

that f(x) <∞ [101, Theorem 8.6]. Note that if f is further assumed to be convex, then

∂f(x) = ∂f(x) = d ∈ Rm|f(x) ≥ f(x) + 〈d,x− x〉 for all x ∈ Rm (D.8)

for all x ∈ Rm such that f(x) <∞ [101, Proposition 8.12].

The gradient mapping defined below is frequently used in the analysis of the class of proximal

gradient methods.

Definition D.3.2 (gradient mapping). Consider the composite cost function F defined in

(5.16), where R is convex and lower semi-continuous, and D is differentiable. The gradient

mapping at x is defined as

Gγ(x) :=x− ProxγR (x− γ∇D(x))

γ. (D.9)

Remark D.3.2. Using Lemma D.3.4, we see from (D.9) that

Gγ(x) ∈ ∇D(x) + ∂R(x− γGγ(x)).

Hence, Gγ(x) = 0 implies

0 ∈ ∇D(x) + ∂R(x)(a)= ∂F(x),

which means x is a critical point of F . In the above, step (a) is proved in Lemma D.3.1.

The following lemma states a result about the linearity of subgradient in a specific case.

Lemma D.3.1. Let g : Rm → (−∞,∞] be differentiable and f : Rm → (−∞,∞] be finite at

x. Define F = g + f . Then ∂F (x) = ∇g(x) + ∂f(x), where ∂ is the (general) subdifferential

defined in Definition D.3.1.

162

Proof. First, we show

∇g(x) + ∂f(x) ⊂ ∂F (x). (D.10)

Take d ∈ ∇g(x) + ∂f(x), then d−∇g(x) ∈ ∂f(x). By Definition D.3.1, there exists a sequence

xkk∈N that satisfies xk → x and f(xk) → f(x) as k → ∞, and a sequence hkk∈N that

satisfies hk ∈ ∂f(xk) for all k ∈ N and hk → d−∇g(x) as k →∞.

Note that limk→∞ F (xk) = limk→∞ g(xk) + f(xk) = g(x) + f(x) = F (x), since g is

continuous and f(xk) → f(x) by the choice of the sequence xkk∈N. Hence, if we can show

hk + ∇g(xk) ∈ ∂F (xk) for all k ∈ N, then we can conclude that limk→∞ hk + ∇g(xk) = d ∈∂F (x). We now show hk +∇g(xk) ∈ ∂F (xk).

lim infx→xkx 6=xk

F (x)− F (xk)− 〈hk +∇g(xk),x− xk〉‖x− xk‖

= lim infx→xkx 6=xk

g(x)− g(xk)− 〈∇g(xk),x− xk〉+ f(x)− f(xk)− 〈hk,x− xk〉‖x− xk‖

≥ lim infx→xkx 6=xk

g(x)− g(xk)− 〈∇g(xk),x− xk〉‖x− xk‖

+ lim infx→xkx 6=xk

f(x)− f(xk)− 〈hk,x− xk〉‖x− xk‖

≥ 0,

where the last inequality follows by ∇g(xk) = ∂g(xk) and hk ∈ ∂f(xk).

Next, we show ∂F (x) ⊂ ∇g(x)+∂f(x). Take d ∈ ∂F (x). Note that f(x) = (−g(x))+F (x),

hence, ∇(−g(x))+d ∈ ∇(−g(x))+∂F (x) ⊂ ∂f(x), where the last inclusion follows by applying

the result established in (D.10). It follows that d ∈ −∇(−g(x)) + ∂f(x) = ∇g(x) + ∂f(x).

The following lemma states that the subgradient of a convex function is a monotone operator.

Lemma D.3.2. Let f : Rm → (−∞,∞] be convex, then ∂f is a monotone operator, namely

for all d1 ∈ ∂f(x1) and d2 ∈ ∂f(x2), we have 〈d1 − d2,x1 − x2〉 ≥ 0.

Proof. Note that f is assumed to be convex in this result, hence the definition of subgradient

reduces to the simple form in (D.8). Using the definition, we have

f(x) ≥ f(x1) + 〈d1,x− x1〉, ∀x ∈ Rm,

f(y) ≥ f(x2) + 〈d2,y − x2〉, ∀y ∈ Rm.

In the above, let x = x2 and y = x1. Then adding the two inequalities, we have

f(x1) + f(x2) ≥ f(x1) + f(x2) + 〈d1 − d1,x2 − x1〉.

163

Hence, 〈d1 − d1,x1 − x2〉 ≥ 0.

The following lemma states that (I+γ∂f)−1 is single-valued, hence is a well-defined function.

Lemma D.3.3. Let f : Rm → (−∞,∞] be convex, then (I + γ∂f)−1 is single-valued, meaning

that for any fixed y, if x1,x2 ∈ (I + γ∂f)−1(y), then x1 = x2.

Proof. By x1,x2 ∈ (I + γ∂f)−1(y), we have

y − x1

γ∈ ∂f(x1) and

y − x2

γ∈ ∂f(x2).

Using Lemma D.3.2,

〈y − x1

γ− y − x2

γ,x1 − x2〉 = −1

γ‖x1 − x2‖2 ≥ 0.

Hence, x1 = x2.

The following lemma relates the proximal operator for a convex function to its subgradient.

Lemma D.3.4. For f defined in Definition 5.3.1, we have

ProxγR(y) = (I + γ∂f)−1(y), ∀y ∈ Rm,

where I is identity mapping.

Proof. Suppose x = arg minx∈Rm f(x) + 12γ ‖y − x‖2, then

0 ∈ ∂x(f(x) +

1

2γ‖y − x‖2

)∣∣∣∣x=x

= ∂f(x) +1

γ(x− y),

where the last equality follows by Lemma D.3.1. Reorganizing the terms in the above, we have

y ∈ (I + γ∂f)(x). It follows that x = (I + γ∂f)−1(y), where the uniqueness, meaning that it is

equality instead of inclusion, follows by Lemma D.3.3.

The following lemma states that the proximal operator of a convex function is firmly non-

expansive, hence Lipschitz with constant 1.

Lemma D.3.5. Let f be defined in Definition 5.3.1. Then Proxγf is firmly nonexpansive,

namely for all y1,y2 ∈ Rm, we have

‖Proxγf (y1)− Proxγf (y2)‖2 ≤ 〈Proxγf (y1)− Proxγf (y2),y1 − y2〉,

164

which implies, by Cauchy-Schwartz inequality, that Proxγf is Lipschitz with constant 1, namely

for all y1,y2 ∈ Rm, we have

‖Proxγf (y1)− Proxγf (y2)‖ ≤ ‖y1 − y2‖.

Proof. Let x1 = Proxγf (y1) and x2 = Proxγf (y2). Then by Lemma D.3.4, we have

y1 − x1

γ∈ ∂f(x1) and

y2 − x2

γ∈ ∂f(x2).

By Lemma D.3.2,

〈y1 − x1

γ− y2 − x2

γ,x1 − x2〉 ≥ 0.

Hence,

‖x1 − x2‖2 ≤ 〈x1 − x2,y1 − y2〉 ≤ ‖x1 − x2‖‖y1 − y2‖.

D.3.2 Convergence of Relaxed FISTA

Although in this work, we only use relaxed FISTA to solve the composite optimization problem

with the specific definition of D and R in (5.17) and (5.18), respective, for diffractive imaging,

our convergence analysis for relaxed FISTA applies to more general problems as stated in the

following proposition.

Proposition D.3.1. Consider a composite cost function of the form (5.16), where D has Lip-

schitz gradient with Lipschitz constant L and R is proper, convex, and lower semi-continuous.

Let xtt≥0 be generated by relaxed FISTA (5.21)–(5.23). Choose 0 < γ ≤ 1−α2

2L for any fixed

α ∈ [0, 1). Then relaxed FISTA achieves critical points of the cost function F in the sense that

limt→∞‖Gγ(xt)‖ = 0. (D.11)

Moreover, if the sequence xtt≥0 is constrained in a bounded set, then every accumulation point

of the xtt≥0 is a critical point of F , namely if x is an accumulation point of xtt≥0, then

0 ∈ ∂F(x).

Proof. First, we show that the condition that D has Lipschitz gradient implies that for all

x,y ∈ RN ,

D(y)−D(x)− 〈∇D(x),y − x〉 ≥ −L2‖x− y‖2. (D.12)

165

Define a function h : R→ R as h(λ) := D(x+λ(y−x)), then h′(λ) = 〈∇D(x+λ(y−x)),y−x〉.Notice that h(1) = D(y) and h(0) = D(x). Using the equality h(1) = h(0) +

∫ 10 h′(λ)dλ, we

have that

D(y) = D(x) +

∫ 1

0〈∇D(x + λ(y − x)),y − x〉dλ = D(x) +

∫ 1

0〈∇D(x),y − x〉dλ

+

∫ 1

0〈∇D(x + λ(y − x))−∇D(x),y − x〉dλ

(a)

≥ D(x)−∫ 1

0λL‖y − x‖2dλ+ 〈∇D(x),y − x〉

= D(x)− L

2‖y − x‖2 + 〈∇D(x),y − x〉,

where step (a) uses Cauchy-Schwartz inequality and the Lipschitz gradient condition (D.20).

Next, by (5.21) and Lemma D.3.4, we have

st − xt

γ−∇D(st) ∈ ∂R(xt). (D.13)

Hence, by definition of subgradient in Definition D.3.1 and the fact that R is convex, it follows

that for all x ∈ RN , k ≥ 0,

R(x) ≥ R(xk) + 〈sk − xk

γ−∇D(sk),x− xk〉. (D.14)

Let y = xt, x = xt+1 in (D.12) and x = xt, k = t+ 1 in (D.14). Adding (D.12) and (D.14), we

have

F(xt+1)−F(xt)≤〈∇D(xt+1)−∇D(st+1),xt+1−xt〉+ 1

γ〈st+1 − xt+1,xt+1−xt〉+L

2‖xt+1−xt‖2

(a)

≤ L

2‖st+1 − xt+1‖2 +

L

2‖xt+1 − xt‖2 +

1

2γ‖st+1 − xt‖2

− 1

2γ‖st+1 − xt+1‖2 − 1

2γ‖xt+1 − xt‖2 +

L

2‖xt+1 − xt‖2

(b)

≤(

1

2γ− L

)‖xt − xt−1‖2 −

(1

2γ− L

)‖xt+1 − xt‖2 −

(1

2γ− L

2

)‖st+1 − xt+1‖2.

In the above, step (a) uses Cauchy-Schwartz, Proposition D.3.2, as well as the fact that 2ab ≤a2 + b2 and 2〈a−b,b− c〉 = ‖a− c‖2−‖a−b‖2−‖b− c‖2. Step (b) uses the condition in the

proposition statement that γ ≤ 1−α2

2L and (5.23), which implies ‖st+1−xt‖ ≤ α θt−1θt+1‖xt−xt−1‖,

where we notice that θt−1θt+1

≤ 1 by (5.22), and α < 1 by our assumption. Summing both sides

166

from t = 0 to K:(1

2γ− L

2

)K−1∑t=0

‖st+1 − xt+1‖2 ≤ F(x0)−F(xt)

+

(1

2γ− L

)(‖x0 − x−1‖2 − ‖xt − xK−1‖2

)≤ F(x0)−F∗, (D.15)

where F∗ is the global minimum. The last step follows by letting x−1 = x0, which satisfies

(5.23) for the initialization x1 = x0, and the fact that F∗ ≤ F(xK). By (D.15), we have

limK→∞∑K−1

t=0 ‖st+1−xt+1‖2 <∞. Note that (D.9) and (5.21) imply Gγ(st) = st−xtγ . Therefore,

we have

limt→∞‖st − xt‖ = 0 and lim

t→∞‖Gγ(st)‖ = 0. (D.16)

Note that

‖Gγ(xt)− Gγ(st)‖(a)

≤ 1

γ‖st − xt‖+

1

γ‖ProxγR(xt − γ∇D(xt))− ProxγR(st − γ∇D(st))‖

(b)

≤ 1

γ‖st − xt‖+

1

γ‖xt − γ∇D(xt)−

(st − γ∇D(st)

)‖

(c)

≤ 1

γ‖st − xt‖+

1

γ‖st − xt‖+ L‖st − xt‖, (D.17)

where step (a) follows by (D.9) and the triangle inequality, step (b) follows by Lemma D.3.5, and

step (c) follows by the triangle inequality and the condition that ∇D is Lipschitz. Taking the

limit as t→∞, we have limt→∞ ‖Gγ(xt)−Gγ(st)‖ = 0, hence, limt→∞ Gγ(xt) = limt→∞ Gγ(st) =

0.

In the following, we assume xtt≥0 is constrained in a bounded set, and establish that every

accumulation point of xtt≥0 is a critical point of the cost function F . By (D.13), we have

st − xt

γ+∇D(xt)−∇D(st) ∈ ∇D(xt) + ∂R(xt) = ∂F(xt),

where the last equality follows from Lemma D.3.1. Let

dt :=st − xt

γ+∇D(xt)−∇D(st),

167

hence dt ∈ ∂F(xt) for all t ≥ 0. Notice that

‖dt‖ = ‖st − xt

γ+∇D(xt)−∇D(st)‖ ≤ 1

γ‖st − xt‖+ ‖∇D(xt)−∇D(st)‖

≤ 1

γ‖st − xt‖+ L‖xt − st‖,

where the last inequality follows by the Lipschitz property of ∇D. Taking the limit as t→∞,

we have limt→∞ dt = 0.

Let xtjj≥0 be a convergent subsequence of xtt≥0 and denote its accumulation point by

x. Suppose that we can show limj→∞F(xtj ) = F(x), then we would have, by the definition of

subgradient in Definition D.3.1, that limj→∞ dtj ∈ ∂F(x), hence 0 ∈ F(x), which is the desired

result. In the following, we justify limj→∞F(xtj ) = F(x). Because D is continuous, we have

limj→∞D(xtj ) = D(x). Hence, we are left to show limj→∞R(xtj ) = R(x). Using (D.14), we

have

R(x) ≥ R(xtj ) + 〈stj − xtj

γ−∇D(stj ), x− xtj 〉,

which implies

lim supj→∞

R(xtj ) ≤ R(x). (D.18)

Because R is also lower semi-continuous,

lim infj→∞

R(xtj ) ≥ R(x). (D.19)

By (D.18) and (D.19), we have obtained limj→∞R(xtj ) = R(x).

For the specific R defined in (5.18), we have that xtt≥0 is constrained in the bounded

set B. The next proposition, Proposition D.3.2, shows that the data-fidelity term D defined

in (5.17) has Lipschitz gradient on a bounded set. In relaxed FISTA, the gradient is taken at

st, hence we need to check to boundedness of stt≥0 as well. Note that st+1 is obtained from

(5.23), hence is a linear combination of xt and xt−1, where the weight α(θt−1θt+1

)∈ [0, 1) since

α ∈ [0, 1) and θt−1θt+1≤ 1 by (5.22). Hence, stt≥0 is also bounded.

Proposition D.3.2. Let D be defined in (5.17), and let U ⊂ RN be a bounded set. Assume that

‖uin‖ <∞ and the matrices L and L defined in (5.4) and (5.14), respectively, are non-singular

for all x ∈ U. Then D(x) with either A or A has Lipschitz gradient on U. That is, there exists

168

an L ∈ (0,∞) such that

‖∇D(x1)−∇D(x2)‖ ≤ L‖x1 − x2‖, ∀x1,x2 ∈ U. (D.20)

Proof. For a vector L that is a function of x, Li denotes the value of L evaluated at xi. Using

this notation, we have

‖∇D(x1)−∇D(x2)‖≤‖diag(u1)HHHz1−diag(u2)HHHz2‖+‖diag(u1)HGHv1−diag(u2)HGHv2‖.

Label the two terms on the RHS as T1 and T2. We will prove T1 ≤ L1‖x1 − x2‖ for some

L1 ∈ (0,∞), and T2 ≤ L2‖x1 − x2‖ can be proved in a similar way for some L2 ∈ (0,∞). The

result (D.20) is then obtained by letting L = L1 + L2.

T1 ≤ ‖diag(u1)HHHz1 − diag(u2)HHHz1‖+ ‖diag(u2)HHHz1 − diag(u2)HHHz2‖

≤ ‖u1 − u2‖‖H‖op‖z1‖+ ‖L−12 ‖‖u

in‖‖H‖op‖z1 − z2‖,

where ‖ ·‖op denotes the operator norm and the last inequality uses the fact that ‖diag(d)‖op =

maxn∈[N ] |dn| ≤ ‖d‖. We now bound ‖u1 − u2‖ and ‖z1 − z2‖.

‖u1 − u2‖ = ‖L−11 uin − L−1

2 uin‖ ≤ ‖L−11 − L−1

2 ‖op‖uin‖ = ‖L−11 (L2 − L1)L−1

2 ‖op‖uin‖

≤ ‖L−11 ‖op‖G‖op‖x1 − x2‖‖L−1

2 ‖op‖uin‖.

‖z1 − z2‖ ≤ ‖Hdiag(x1)u1 −Hdiag(x1)u2‖+ ‖Hdiag(x1)u2 −Hdiag(x2)u2‖

≤ ‖H‖op‖x1‖‖u1 − u2‖+ ‖H‖op‖x1 − x2‖‖L−12 ‖op‖uin‖.

Then the result T1 ≤ L1‖x1 − x2‖ follows by noticing that ‖x1‖, ‖uin‖, ‖G‖op, ‖H‖op, and

‖L−1i ‖op for i = 1, 2 are bounded, and the fact that ‖z1‖ ≤ ‖y‖+‖H‖op‖x1‖‖L−1

1 ‖op‖uin‖ <∞.

The Lipschitz property of the gradient (5.28) in the 3D case can be proved in a similar way;

we omit the proof to avoid repetition.

Remark D.3.3. By Propositions D.3.1 and D.3.2, we conclude that if we use relaxed FISTA to

solve our diffractive imaging problem, namely the nonconvex optimization problem (5.16) with

D defined in (5.17) and R defined in (5.18), then every accumulation point of xtt≥0 generated

by relaxed FISTA is a critical point of F defined in (5.16).

169

abstract - nc state universityabstract ma, yanting. solving large-scale inverse problems via...

Documents