bayesian methods for sparse data decomposition and blind...

Bayesian Methods for Sparse

Data Decomposition and Blind

Source Separation

Evangelos Roussos

Wolfson College

University of Oxford

A thesis submitted for the degree of

Doctor of Philosophy

Trinity Term 2012

Contents

1 Introduction 1

2 Principles of Data Decomposition 12

2.1 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.1 Probability Space . . . . . . . . . . . . . . . . . . . . . 13

2.1.2 Three Simple Rules . . . . . . . . . . . . . . . . . . . . 14

2.2 Second order decompositions: Singular value decomposition

and Principal Component Analysis . . . . . . . . . . . . . . . 16

2.3 Higher-order decompositions: Independent Component Analysis 18

2.3.1 Probabilistic Inference for ICA . . . . . . . . . . . . . . 24

2.4 Sparse Decompositions . . . . . . . . . . . . . . . . . . . . . . 30

2.4.1 Parsimonious representation of data . . . . . . . . . . . 30

2.4.2 Sparse Decomposition of Data Matrices . . . . . . . . . 35

2.5 The Importance of using Appropriate Latent Signal Models . 41

2.6 Data modelling: A Bayesian approach . . . . . . . . . . . . . 46

3 An Iterative Thresholding Approach to Sparse Blind Source

Separation 50

i

3.1 Blind Source Separation as a Blind Inverse Problem . . . . . . 50

3.2 Blind Source Separation by Sparsity Maximization . . . . . . . 58

3.2.1 Exploiting Sparsity for Signal Reconstruction . . . . . 58

3.2.2 Exploiting Sparsity for Blind Source Separation: Sparse

Component Analysis . . . . . . . . . . . . . . . . . . . 60

3.3 Sparse Component Analysis by Iterative Thresholding . . . . . 76

3.3.1 Iterative Thresholding . . . . . . . . . . . . . . . . . . 76

3.3.2 The SCA-IT Algorithm . . . . . . . . . . . . . . . . . . 81

3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 85

3.4.1 Natural Image Separation . . . . . . . . . . . . . . . . 86

3.4.2 Blind Separation of More Sources than Mixtures . . . . 92

3.4.3 Extracting Weak Biosignals in Noise . . . . . . . . . . 100

3.5 Setting the Threshold . . . . . . . . . . . . . . . . . . . . . . . 120

4 Learning to Solve Sparse Blind Inverse Problems using Bayesian

Iterative Thresholding 124

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

4.2 Bayesian Inversion . . . . . . . . . . . . . . . . . . . . . . . . 129

4.3 Sparse Component Analysis Generative Model . . . . . . . . . 132

4.3.1 Graphical Models . . . . . . . . . . . . . . . . . . . . . 132

4.3.2 Construction of a Hierarchical Graphical Model for

Blind Inverse Problems under Sparsity Constraints . . 133

4.3.3 Latent Signal Model: Wavelet-based Functional Prior . 142

4.4 Inference and Learning by Variational Lower Bound Maxi-

mization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

ii

4.4.1 Variational Lower Bounding . . . . . . . . . . . . . . . 148

4.4.2 Estimation Equations . . . . . . . . . . . . . . . . . . . 154

4.4.3 Computing the SCA free energy . . . . . . . . . . . . . 166

4.4.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 167

4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

4.5.1 Comparison of ordinary and Bayesian iterative thresh-

olding . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 171

4.6.1 BIT-SCA: Simulation Results . . . . . . . . . . . . . . 171

4.6.2 Extraction of weak biosignals in noise . . . . . . . . . . 177

5 Variational Bayesian Learning for Sparse Matrix Factoriza-

tion 180

5.1 Sparse Representations and Sparse Matrix Factorization . . . 180

5.2 Bayesian Sparse Decomposition Model . . . . . . . . . . . . . 190

5.2.1 Graphical modelling framework . . . . . . . . . . . . . 192

5.2.2 Modelling sparsity: Bayesian priors and wavelets . . . . 195

5.2.3 Hierarchical mixing model and Automatic Relevance

Determination. . . . . . . . . . . . . . . . . . . . . . . 200

5.2.4 Noise level model . . . . . . . . . . . . . . . . . . . . . 201

5.3 Approximate Inference by Free Energy Minimization . . . . . 201

5.3.1 Variational Bayesian Inference . . . . . . . . . . . . . . 201

5.3.2 Posteriors for the variational Bayesian sparse matrix

factorization model . . . . . . . . . . . . . . . . . . . . 204

5.3.3 Free Energy . . . . . . . . . . . . . . . . . . . . . . . . 210

iii

5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

5.4.1 Auditory stimulus data set . . . . . . . . . . . . . . . . 212

5.4.2 Auditory-visual data set . . . . . . . . . . . . . . . . . 214

5.4.3 Visual data set: An Experiment with Manifold Learning214

5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

6 Epilogue 222

iv

List of Figures

1.1 A sketch of brain activations captured by functional MRI . . . 5

2.1 Seeking structure in data. . . . . . . . . . . . . . . . . . . . . 20

2.2 The Olshausen and Field model as a network . . . . . . . . . . 31

2.3 Activities resulting from the model of Olshausen and Field . . 32

2.4 Learned receptive fields (filters) from the sparse coding algo-

rithm of Olshausen and Field. . . . . . . . . . . . . . . . . . . 33

2.5 Geometric interpretation of sparse representation. . . . . . . . 34

2.6 Clustering of a sparse set of points on the unit hypersphere. . 35

2.7 A geometric intuition into sparse priors. . . . . . . . . . . . . 38

2.8 Effect of an incorrect source model specification. . . . . . . . . 42

2.9 Effect of an incorrect source model specification for a blind

image separation problem. . . . . . . . . . . . . . . . . . . . . 43

2.10 ‘Cameraman’ standard test image. . . . . . . . . . . . . . . . 45

2.11 Density modelling using the MoG model. . . . . . . . . . . . . 45

3.1 Conceptual scheme for forward/inverse problems. . . . . . . . 51

3.2 Sparsity function and derivative of the model of Olshausen

and Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

v

3.3 Wavelet coefficients of the ‘Cameraman’ image. . . . . . . . . 64

3.4 A signal exhibiting smooth areas and localized lower dimen-

sional singularities. . . . . . . . . . . . . . . . . . . . . . . . . 68

3.5 A family of ℓp norms for various values of p. . . . . . . . . . . 70

3.6 Three distributions of wavelet coefficients with equal variance

but different kurtosis . . . . . . . . . . . . . . . . . . . . . . . 72

3.7 How the ℓ1 penalty leads to sparse representations . . . . . . . 73

3.8 Soft thresholding function, corresponding to ℓ1 penalization. . 79

3.9 An illustration of the Iterative Thresholding (IT) algorithm as

a double mapping operation among signal spaces. . . . . . . . 82

3.10 Sparse Component Analysis model as a layered “feedforward”

network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

3.11 ‘Cows & Butterfly’ dataset: original sources, downscaled to

64× 64 pixels. . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

3.12 IFA results: Separated image of (a) ‘Cows’ and (b) ‘Butterfly’.

While the source images are recovered, there is some presence

of noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3.13 Scatterplot of the ‘Cows & Butterfly’ estimated images, after

having been unmixed by IFA. . . . . . . . . . . . . . . . . . . 89

3.14 Probability density functions of the sources as estimated from

the IFA algorithm. M = 4 Gaussian components were used for

each source model. The brown curve is the estimated mixture

density. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

3.15 SCA-IT results after convergence of the algorithm: separated

images of ‘Cows’ and ‘Butterfly’. . . . . . . . . . . . . . . . . 90

vi

3.16 Scatterplot of the original and SCA-IT unmixed images. . . . 91

3.17 The ‘Cows & Butterfly’ estimated source space, (s1,n, s2,n)Nn=1,

after convergence of the SCA-IT algorithm. . . . . . . . . . . 91

3.18 Empirical histograms of the ‘Cows & Butterfly’ sources after

convergence of the SCA-IT algorithm. . . . . . . . . . . . . . 92

3.19 ‘Terrace’ (reflections) dataset: observations. . . . . . . . . . . 92

3.20 ‘Terrace’ dataset: Scatterplot of observations, (x1,n, x2,n). . . 93

3.21 ‘Terrace’ dataset: Estimated (unmixed) source images. . . . . 93

3.22 ‘Terrace’ dataset: Scatterplot of estimated sources after 100

iterations of SCA/IT. . . . . . . . . . . . . . . . . . . . . . . . 93

3.23 ‘Terrace’ dataset: Empirical histograms of estimated sources

after 100 iterations of SCA/IT. . . . . . . . . . . . . . . . . . 94

3.24 Overcomplete blind source separation: true sources,sl

l=1,...,3,

(unknown). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

3.25 Overcomplete blind source separation: observed sensor sig-

nals,xd

d=1,...,2. . . . . . . . . . . . . . . . . . . . . . . . . . 96

3.26 Overcomplete blind source separation: scatterplot of sensor

signals,(x1,n, x2,n)

. . . . . . . . . . . . . . . . . . . . . . . 97

3.27 Overcomplete blind source separation: spectrogram, Xl(ωk, tm),

of the sensor signals. . . . . . . . . . . . . . . . . . . . . . . . 98

3.28 Overcomplete blind source separation: Scatterplot of sensor

signals in the transform domain,(X1,n, X2,n)

, where n =

n(ωk, tm) is the single index that results from flattening each

two-dimensional spectrogram image into a long vector. . . . . 98

vii

3.29 Overcomplete blind source separation: estimated sources,sl

l=1,...,3,

by the SCA/IT algorithm. . . . . . . . . . . . . . . . . . . . 99

3.30 Simulated fMRI data. Unmixing the 4 mixtures of rectangular

components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

3.31 Results of the SCA/IT model applied on the simulated data

of subsection 3.4.3, for Cases 1–4 . . . . . . . . . . . . . . . . 110

3.32 Spatial map from SPM . . . . . . . . . . . . . . . . . . . . . . 112

3.33 Timecourse from SPM . . . . . . . . . . . . . . . . . . . . . . 112

3.34 Thresholded spatial map from SPM . . . . . . . . . . . . . . . 113

3.35 Spatial maps s1, s2 from ICA. . . . . . . . . . . . . . . . . . . 114

3.36 Timecourses a1, a2 from ICA . . . . . . . . . . . . . . . . . . 114

3.37 Spatial map s1 from SCA . . . . . . . . . . . . . . . . . . . . 116

3.38 Timecourse a1 from SCA . . . . . . . . . . . . . . . . . . . . . 116

3.39 Correlation matching . . . . . . . . . . . . . . . . . . . . . . . 118

3.40 Receiver Operating Characteristic (ROC) curves . . . . . . . . 119

3.41 Schematic representation of the Iterative Thresholding algo-

rithm for blind signal separation as implemented in our system.123

4.1 The Bayesian solution to an inverse problem is a posterior

distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

4.2 A directed acyclic graph (DAG) representation for generative

latent variable models. . . . . . . . . . . . . . . . . . . . . . . 133

4.3 Structural transformations of the standard component analy-

sis graphical model. . . . . . . . . . . . . . . . . . . . . . . . . 135

4.4 An image exhibiting smooth areas and structural features. . . 146

viii

4.5 Heavy-tailed distributions for wavelet coefficients. . . . . . . . 147

4.6 Thresholding (shrinkage) functions. . . . . . . . . . . . . . . . 160

4.7 BIT-SCA: Source Reconstruction . . . . . . . . . . . . . . . . 173

4.8 BIT-SCA: Scatterplot . . . . . . . . . . . . . . . . . . . . . . . 173

4.9 BIT-SCA: Wavelet coefficients . . . . . . . . . . . . . . . . . . 174

4.10 BIT-SCA: Empirical histograms . . . . . . . . . . . . . . . . . 174

4.11 BIT-SCA: Evolution of the elements A . . . . . . . . . . . . . 175

4.12 Learning curve: Evolution of the Amari distance . . . . . . . . 176

4.13 Evolution of the NFE . . . . . . . . . . . . . . . . . . . . . . . 177

4.14 Spatial map s1 from BIT-SCA . . . . . . . . . . . . . . . . . . 178

4.15 Timecourse a1 from BIT-SCA . . . . . . . . . . . . . . . . . . 178

4.16 Receiver Operating Characteristic curve . . . . . . . . . . . . 179

5.1 Matrix factorization view of component analysis. . . . . . . . . 184

5.2 Geometric interpretation of sparse representation. . . . . . . . 188

5.3 The probabilistic graphical model for feature-space component

analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

5.4 Variational Bayesian Sparse Matrix Factorization graphical

model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

5.5 Sparse mixture of Gaussians (SMoG) prior for the coefficients

cl,λ. Blue/green curves: Gaussian components; thick black

curve, mixture density, p(cl,λ). . . . . . . . . . . . . . . . . . . 197

5.6 A directed graphical model representation of a sparse mixture

of Gaussian (SMoG) densities. . . . . . . . . . . . . . . . . . . 198

ix

5.7 Time courses and corresponding spatial maps from applying

the variational Bayesian sparse decomposition model to an

auditory fMRI data set. . . . . . . . . . . . . . . . . . . . . . 213

5.8 Time courses and corresponding spatial maps from apply-

ing the variational Bayesian sparse decomposition model to

a visual-auditory fMRI data set. . . . . . . . . . . . . . . . . . 215

5.9 Evolution of the negative free energy of the VB-SMF model . 219

5.10 Variances of the columns of the mixing matrix. . . . . . . . . . 220

5.11 Time course and corresponding spatial map from applying the

variational Bayesian sparse decomposition model to fMRI data

set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

5.12 Receiver operating characteristic (ROC) curve for the LPP/VB-

SMF experiment . . . . . . . . . . . . . . . . . . . . . . . . . 221

x

Abstract

In an exploratory approach to data analysis, it is often useful to consider

the observations as generated from a set of latent generators or ‘sources’ via

a generally unknown mapping. Reconstructing sources from their mixtures is

an extremely ill-posed problem in general. However, solutions to such inverse

problems can, in many cases, be achieved by incorporating prior knowledge

about the problem, captured in the form of constraints.

This setting is a natural candidate for the application of the Bayesian

methodology, allowing us to incorporate “soft” constraints in a natural man-

ner. This Thesis proposes the use of sparse statistical decomposition methods

for exploratory analysis of datasets. We make use of the fact that many nat-

ural signals have a sparse representation in appropriate signal dictionaries.

The work described in this paper is mainly driven by problems in the analysis

of large datasets, such as those from functional magnetic resonance imaging

of the brain for the neuro-scientific goal of extracting relevant ‘maps’ from

the data.

We first propose Bayesian Iterative Thresholding, a general method for

solving blind linear inverse problems under sparsity constraints, and we ap-

ply it to the problem of blind source separation. The algorithm is derived

by maximizing a variational lower-bound on the likelihood. The algorithm

generalizes the recently proposed method of Iterative Thresholding. The

probabilistic view enables us to automatically estimate various hyperparam-

eters, such as those that control the shape of the prior and the threshold in

a principled manner.

We then derive an efficient fully Bayesian sparse matrix factorization

model for exploratory analysis and modelling of spatio-temporal data such

as fMRI. We view sparse representation as a problem in Bayesian inference,

following a machine learning approach, and construct a structured genera-

tive latent-variable model employing adaptive sparsity-inducing priors. The

construction allows for automatic complexity control and regularization as

well as denoising.

The performance and utility of the proposed algorithms is demonstrated

on a variety of experiments using both simulated and real datasets. Exper-

imental results with benchmark datasets show that the proposed algorithm

outperforms state-of-the-art tools for model-free decompositions such as in-

dependent component analysis.

ii

Acknowledgements

It has been quite a journey, from Geomatics Engineering, to Neural Net-works and Machine Learning, to Neuroscience image analysis. During thistrip I have been lucky enough to meet, interact with, and collaborate withmany excellent scientists. First and foremost, I would like to thank my super-visor, Prof. Steven Roberts, a fantastic scientist, engineer, and teacher for in-troducing me to the wonderful world of ICA and graphical models and for hisconstant advice, encouragement and support. I would like to thankmy advi-sor at Princeton University, Prof. Ingrid Daubechies, the mother of wavelets,for asking the important and difficult scientific questions, for introducing meto the method of Iterative Thresholding, and for teaching me all that I knowabout wavelets. I would like to thank my mentor, Prof. Michael Sakellariou,a true Rennaiseance Man, for sharing with me deep insights about Engineer-ing in the 21st Century, for his constant support, and for discussing bothscience and life with me. I would like to thank Prof. Demetre Argialas forteaching me everything I know about Remote Sensing and Geomorphometryand his kindness. I would like to thank all the people at Oxford’s PatternAnalysis and Machine Learning group, and especially Dr. Iead Rezek forscientific discussions, his help and support, and his friendship, Dr. MichaelOsborne discussions and for being kind enough to share his Thesis style file,and Dr. Steve Reece for discussions and fun. I would also like to thank thepeople at Princeton’s Program in Applied and Computational Mathematicsand Princeton Neuroscience Institute, and especially Professors Jim Haxbyand Jonnathan Cohen for insightful discussions about functional MRI andneuroscience in general as well as Sylvain Takerkart and Michael Bennharoshfor discussions and fun. Last, but not least, I would like to offer my deepestgratitude to my family and friends for their love and for making all thesepossible.

i

Chapter 1

Introduction

In an exploratory approach to data analysis, it is often useful to consider the

observations as intricate mixtures generated from a set of latent generators, or

‘sources’, via a generally unknown mapping [149]. In the neuroimaging field,

for example, McKeown et al. [125] proposed the decomposition of spatio-

temporal functional magnetic resonance imaging (fMRI) data of the brain

into independent spatial components, expressing the observations as a lin-

ear superposition of products of ‘spatial maps’ and time-courses. In weather

data mining, Basak et al. [14] proposed to model weather phenomena as mix-

tures of ‘stable activities’ where the variation in the observed variables can

be viewed as a mixture of several “independently occuring” spatio-temporal

signals with different strengths. Many other physical and biological phe-

nomena can be better studied in a ‘latent’ signal space that best reveals the

internal structure of the data.

The problem of recovering the sources from a mixture is an extremely ill-

posed one, especially for the noisy ‘overcomplete’ case, where we have more

sources than observations1. Solutions to such inverse problems can, in many

1In the context of the problem of mixture separation, the term ‘overcomplete’ refersto the case where the number of hidden sources is greater than the number of sensors.

1

cases, be achieved by incorporating prior knowledge about the problem, cap-

tured in the form of constraints [45]. From a machine learning perspective,

such problems fall in the unsupervised learning class of problems, since only

the “responses” but not the “stimuli” are observed in any training set. In

such cases, an internal data organization criterion, such as Infomax [113],

can be used; this acts implicitly as a regularizer, making the problem well-

posed2. Problems that unsupervised learning is aimed at, and are relevant

to this text, are (Ghahramani and Roweis, [74]):

• Finding hidden “causes” or sources in data

• Data density modelling

• Dimensionality reduction

• (Soft) Clustering

Component analysis (CA) is a data analysis framework that intends to

extract the underlying hidden factors, sources, or “causes”3 that generate

the observed data. CA can be seen as a class of methods that attempt to

solve the following complementary problems:

• Blind separation of hidden sources (BSS) under an unknown mixing—

an inverse problem. The coctail party problem in acoustic signal pro-

cessing and systems and cognitive neuroscience is the prototypical ap-

plication of BSS. It refers to the ability of the auditory system to sep-

In the linear mixing scenario (see Sect. 2.3), this has the consequence that the numberof columns in the mixing operator is greater than the number of its rows. The system ofequations formed is, therefore, underdetermined; see Eq. 2.3 and Fig. 5.1 later.

2Another example of a classical organization principle, coming from the pattern recog-nition field, is to minimize intra-class distances of patterns while maximizing inter-classones. This principle is suitable for classification problems.

3Where the term ‘causal’ is taken to mean the relation ‘is-influenced-by’ here. We donot try to model causality in the sense of Pearl [140], for example, in this text.

2

arate and extract “interesting” sources, such as the voice of one’s in-

terlocutor, from a mixture of conversations and background noises4.

• Representation of data in new coordinates such that the resulting

signals have certain desired properties, such as being maximally inde-

pendent or, in our case, maximally sparse. Indeed, geometrically, CA

can be seen as a method for finding an intrinsic coordinate frame in

data space (see Figs 2.1 and 2.5 later).

• Soft clustering. Miskin provides an additional view of component

analysis in [94], as an alternative to ‘hard’ multi-way clustering. Under

this view, the data vectors are decomposed as linear combinations of

hidden prototypical vectors (the columns of a “mixing” matrix) modu-

lated by a weight matrix containing the amount of contribution of each

basis to the mixture5 (corresponding to the “sources”). An example of

soft clustering by source separation can be found in [149].

There have been several algorithms for CA and BSS in the statistics and

machine learning literature, from information maximization [16], maximum

likelihood [118], and Bayesian [105], [36] points of view, as an energy-based

model [163], as well as several others from a signal processing perspective

(see for example [87]). This text proposes the use of sparse statistical de-

composition methods [136], [109], [76], [111] for the exploratory analysis of

data sets. We will consider blind inverse problems such as blind source

separation and data decomposition under an unsupervised machine learning

framework, for the purpose of learning the parameters and hyperparameters

of the models from data. In particular, we shall approach the above problems

4While the brain has solved the problem for millions of years, developing computationalmethods that perform equally well is an open research problem.

5The case of pure mixtures corresponds to source vectors of the formsn = (0, . . . , 0, 1, 0, . . . , 0), for the n–th data point, with 1 in the lth position and zeroesin all other positions, picking the lth prototypical vector. Soft clustering corresponds tothe case where the elements sn,l may be non-integers.

3

from a hierarchical, probabilistic modelling perspective and the problem of

estimating the sources as an inference problem in the Bayesian framework

[147], [99].

Sparse Statistical Decomposition Methods

Problem and Motivating Application

The work described in this text is mainly motivated by problems in the anal-

ysis of high-dimensional spatiotemporal datasets, such as those obtained from

functional magnetic resonance imaging of the brain for the neuro-scientific

goal of extracting relevant ‘maps’ from the data. This can be stated as a

‘blind’ source separation problem [125]. Recent experiments in the field of

neuroscience [82], [46] show that these maps are sparse, in some appropriate

sense. Figure 1.1 provides a generic description of the sought-for maps in

a nutshell. A more detailed overview of the problem of decomposing fMRI

data will be given in Sect. 3.4.3. The separation problem can be formu-

lated as a bilinear decomposition model solved by component analysis-type

algorithms, viewed as techniques for seeking sparse components, provided

appropriate distributions for the sources have been assigned [46]. In contrast

to many classical CA algorithms, where an unmixing matrix is first learned

and from that the sources are deterministically estimated (see e.g. [15]),

the algorithms proposed here perform inference and learning of the mixing

operator and the sources simultaneously.

Inspired by problems in imaging neuroscience, namely the extraction

of activations from brain fMRI data, Daubechies et al. [46] conducted a

large-scale experiment designed to test the independence assumption as the

underlying reason for the apparent success of independent component anal-

4

Figure 1.1: A sketch of brain activations (red areas) as a result of environ-mental stimuli, captured by functional MRI, which is one of the patterns wewant to extract with our models. In this figure, which is traced from Haxbyet al. [82], a brain slice in the transverse (horizontal) plane is shown. Ouraim is to capture the neuroscientifically desirable properties of smoothnessand localization of the activations and the sparsity of the spatial maps withrespect to appropriately chosen signal dictionaries. This will be done by uti-lizing ‘soft’ constraints as explained later in the text. Note that the modelsshould also allow for distributed representations to be extracted.

ysis decompositions6 in fMRI. They tested various parameters, such as the

effect of localisation (overlap) of the sources in the quality of separation, the

size of the region of interest (ROI, i.e. “search window”), and the effect of the

source prior, in a well-controlled experiment, both simulated and in-vivo, for

two ‘classical’ ICA methods, InfoMax and FastICA7 (Benharrosh et al., [17];

Golden, [77]; Daubechies et al., [46]). It was shown there that, while ICA

is robust in general, rather surprisingly, both methods occasionally failed to

separate sources that were designed to be independent, while on the contrary

it managed to separate sources that were dependent, casting doubt on the

validity of the independence approach for brain fMRI. Real fMRI data are

6Independent component analysis (ICA) is a class of methods designed to separatemultivariate signals into components that are assumed to be statistically independent.The ICA method will be formally introduced in Chapter 2.

7More general ICA algorithms, which assume less about the components, can separateinto independent components mixtures for which Infomax and FastICA fail [152], [151].Nevertheless, these two are the most used ICA algorithms for brain fMRI.

5

highly unlikely to be generated by brain processes that are independent, in

the mathematical sense at least, as demonstrated by the recent interest in ex-

tracting networks of interconnected “modules”. However, Daubechies et al.

noted that ICA algorithms work well for brain fMRI decompositions if the

components have sparse distributions8. They concluded that independence

is not the right mathematical framework for blind source separation in fMRI

and proposed the development of new mathematical tools based on explicitly

optimizing for sparsity. An analogous observation was made by Donoho and

Flesia [52] in the context of image processing.

The intuition gained from the above problems can be applicable to a much

broader range of other scientific data-sets. Sparsity is a desirable property,

aiding interpretation of results. Sparsity-inducing constraints help us form

compressed representations of large, redundant, and noisy datasets by retain-

ing relevant information and discarding irrelevant one as “noise”. Sparsity

also facilitates model interpretabilty by revealing the most selective dimen-

sions in the data. In a machine learning context, this estimation can be done

in a data-driven fashion. The work presented here continues on this theme

of sparse data decompositions.

Bayesian inference provides a powerful methodology for data analysis

[156] by providing a principled method for using our domain knowledge,

making all our assumptions explicit [97]. The above setting is a natural

candidate for the application of the Bayesian methodology, allowing us to

incorporate ‘soft’ constraints in a natural manner, formulated as prior distri-

butions. Moreover, its probabilistic foundation allows for building hierarchi-

cal models that describe the data generation process, where hard decisions

about certain parameter settings can be deferred to higher levels of inference

and/or learned from the data itself. In a full Bayesian approach, parameters

8In particular, ‘generalized exponential’ distributions, of the form p(s) ∝ exp(−β|s|α)with α 6= 2 — see the Introduction of ref. [46].

6

whose value is unknown are ‘integrated out’, a process which is referred to as

marginalization. Bayesian modelling allows the incorporation of uncertainty

in both the data and the model into the estimation process, thus prevent-

ing ‘overfitting’. It essentially builds families of models for learning and

inference, thus abstracting the model space. The outputs of fully Bayesian

inference are uncertainty estimates over all variables in the model, given the

data, i.e. posterior distributions.

Proposed approach

The general theme we are going to discuss in this text will be based on

choosing appropriate priors for the latent variables (i.e. the unknown signals)

that explicitly reflect our domain knowledge, prior beliefs, or expectations

about the solution. In particular, the main idea can be stated as

Statement 1 For sparse sources, sparsity, rather than independence, is the

right constraint to be imposed.

This will be captured in appropriate optimization functionals and in specially

constructed algorithms. Another central idea is that

Statement 2 Descriptive features lead to sparse representations.

In many cases sparsity is induced by transforming the physical-domain signals

into a feature-space with “maximum representability”. We note here that

sparsity is a relative property, with respect to a given ‘dictionary’ of signals.

This will become more concrete in section 3.2.2. In particular, in that section

we will answer the question of which dictionary is likely to result in sparse

representations of our particular data in question. The above statements

form the underlying hypothesis of this Thesis. To test the above ideas we

will develop models and algorithms for Bayesian data decomposition under

sparsity constraints.

7

In particular, in the third and fourth chapters of this Thesis, we propose

Bayesian Iterative Thresholding, a method for solving blind inverse problems

under sparsity constraints, with emphasis on the multivariate problem of

blind source separation. In contrast to optimizing for maximally independent

components, as in independent component analysis, we capture this more

natural constraint by explicitly optimizing for maximally sparse components.

We present two, complementary, formulations of the method. In the

third chapter, the problem is initially formulated in an energy minimization

framework such that it enforces sparseness and smoothness of the inferred

signals and a novel algorithm for blind source separation, iterative thresh-

olding (Daubechies et al., [45]), is presented. By exploiting the sparseness of

the components in appropriate dictionaries of functions and the choice of a

sparsity-enforcing prior on the associated expansion coefficients, we can ob-

tain very sparse representations of signals. However, this leads to a difficult

optimization problem. In order to solve it we use a variational formulation

that leads to an iterative algorithm that alternates between estimating the

sources and learning the mixing matrix. The formulation of the problem

using functional criteria, as an inverse problem was first suggested by Ingrid

Daubechies while the author was a Visiting Fellow at Princeton University’s

Program of Applied and Computational Mathematics. It was inspired by

the needs of neuroscientists working with fMRI data, and in particular it

is the result of discussions with the members of the Princeton Neuroscience

Institute, for the problem of designing better algorithms for extracting mean-

ingful activations. It was given its energy interpretation and current form by

the author. In particular, I contributed the solution, implementation, and

experimental results. In addition, I proposed the new way of estimating the

critical, threshold parameter.

We then employ Bayesian inversion ideas in the fourth chapter and show

how the Bayesian formulation generalizes the IT algorithm of Daubechies et

8

al. [45]. We cast blind inverse problems as statistical inference problems in

a Bayesian framework, and show how these two formulations, deterministic

and probabilistic, are related. The central modelling idea is that of structural

priors, in the sense of a ‘parametric free-form’ solution (Sivia, [156]). The

structural prior used captures the desired properties of localization, smooth-

ness, and sparseness of the unknown signals and is adaptively learned from

the data. We then derive a variational bound on the data likelihood and

propose a (structured) mean-field variational posterior. This enables us to

derive analytic formulas for the estimation of the latent signals. Under this

formulation, a generalized form of iterative thresholding emerges as a subrou-

tine of the full set of estimation equations of the unknowns of the problem.

The probabilistic view enables us to automatically estimate various hyperpa-

rameters, such as those that control the shape of the prior and the threshold,

in a principled manner; setting these manually can be a delicate and copious

process.

In the fifth chapter, we propose a feature-space sparse matrix factoriza-

tion (SMF) model for spatiotemporal data, transforming the signals into a

domain where the modelling assumption of sparsity of their expansion coef-

ficients with respect to an auxiliary dictionary is natural [176], [91], [152].

Decomposition is performed completely in the wavelet domain for this model.

We will take the view of sparse representation as sparse matrix factorization

here. Blind sparse representation algorithms (i.e. sparse representations

in which both the dictionary and the coefficients of the representation are

unknown, sometimes called “model-free”) can be naturally formulated as

bilinear decomposition models, where the observations are modelled as a fac-

torization of a ‘mixing’ matrix times a ‘source’ matrix (see Fig. 5.1). That is,

the observations are sparsely represented, via the source coefficients, in the

base comprised of the mixing vectors. In this text we take the view of SMF

as a bilinear generative model, and the problem of estimating the factors as

9

a statistical inference problem in the Bayesian framework again [99].

An important goal of this model is to be a transparent alternative to

sparse representation models such as those of Li et al. [112], [111]. Our ap-

proach to achieving this is by using three ingredients: graphical probabilistic

models, automatic determination of each basis element’s relevance to the

representation, and Bayesian model selection. We follow a graphical mod-

elling formalism and use hierarchical source and mixing models. We then

apply Bayesian inference to the problem. This allows us to perform model

selection in order to infer the complexity of the decomposition, as well as

automatic denoising.

In many datasets, the intrinsic dimensionality of the data is much lower

than that of the ‘ambient’ observation space. For the case of Gaussian data

(i.e. data that can be described geometrically by a hyper-ellipsoid), the

principal axes, corresponding to the largest eigenvalues of the data covari-

ance matrix, contain the signal space, that is “most of the information”; the

rest are usually regarded as noise and can be discarded. For non-Gaussian

data, care has to be taken not to discard useful information that is of low

signal power. A prime example of this is in neuroimaging datasets, where

the ‘activations’ power is just a few percent (e.g. on the order of 1%–5%)

higher than the ‘baseline’ activity [106]. The model proposed here utilizes

a technique developed in the machine learning community that tries to pick

those “important” dimensions in the data, and is based on the principle of

parsimony.

The requirement of analyzing large and/or high-dimensional, real-world

datasets makes the use of sampling methods difficult and leads to the need

for an alternative, efficient Bayesian formulation. Since exact inference and

learning in such a model is intractable, we follow a structured variational

Bayesian approach in the conjugate-exponential family of distributions, for

efficient unsupervised learning in multi-dimensional settings. The algorithm

10

is intended to be used as an efficient fully Bayesian method for exploratory

analysis and modelling of scientific and statistical datasets and for decom-

posing multivariate timeseries, possibly embedded in a larger data analysis

system.

Using fMRI datasets as examples of complex, high-dimensional, spa-

tiotemporal data, we show that the proposed models outperform standard

tools for model-free decompositions such as independent component analysis.

11

Chapter 2

Principles of Data

Decomposition

This chapter provides an introductory background on data decomposition

methods. We start from the classical algebraic method of singular value de-

composition and then introduce principal and independent component anal-

ysis. The chapter continues with the main subject of this Thesis, sparse

representation and decomposition. Finally an introduction to Bayesian in-

ference is given. Specific computational methods for Bayesian inference will

be introduced in the following chapters as needed.

2.1 Probability Theory

When modelling complex systems we are unavoidably faced with imperfect or

missing information, especially in the measurement and information sciences.

This may have several causes, but it is mainly due to

• The cost of obtaining and processing vast amounts of information,

12

• Inherent system complexity.

Probability theory is a conceptual and computational framework for rea-

soning under uncertainty. Probabilities model uncertainty regarding the oc-

curence of random events. Assigning probability measures on uncertain quan-

tities reflects precisely our lack of information about the quantities at hand.

According to Cox’s theorem [40], probability is the only consistent, universal

logic framework for quantitatively reasoning under uncertainty. Moreover,

probability theory offers a consistent framework for modelling and inference.

Jaynes [97] viewed probability theory as a unifying tool for plausible rea-

soning in the presence of uncertainty. From a modeler’s point of view, the

greatest practical advantage of probability theory is that it offers modular-

ity and extensibility: probability theory acts as “glue” for linking different

models together.

2.1.1 Probability Space

The axiomatic formulation of probability starts by defining a probability

space, which is a tuple (Ω, P ) that describes our idea about uncertainty with

respect to a random experiment. It defines:

• A sample space, Ω, of possible outcomes, ωi, of a random experiment

and

• A probability measure, P , which describes how likely an outcome is.

Now, let A be a collection of subsets of Ω, called random events. Then for

A ∈ A the two following conditions must hold:

• Probabilities must be non-negative, P (A) ≥ 0, and P (Ω) = 1,

• Probabilities must be additive: for two disjoint events, A, B,

P (A ∩ B) = P (A) + P (B) .

13

We also define the conditional probability, which can be thought of as “a

probability within a probability”,

P (A|B) =P (A ∩B)

P (B), P (B) 6= 0 .

Then random variables (r.v.’s) are defined as functions from Ω to a range,

R, e.g. a subset of R or N, etc. These can, inversely, define events as:

R→ Ω : A(x) =

ω ∈ Ω :[x(ω)

]

,

where[·]

denotes a “predicate” (e.g. the event ‘x > 2’), and therefore act

as “filters” of certain experimental outcomes.

Probability densities are defined as densities of probability measures:

p(x) =d

dxP (A(x))|x, with A(x) =

x′ ∈ [x, x+ dx]

, x ∈ R .

Finally joint densities (for the case of two random variables X, Y ) are

defined as

pXY (x, y) = p(

ω : X(ω) = x ∧ Y (ω) = y)

.

Joint densities of more than two r.v.’s are defined analogously.

2.1.2 Three Simple Rules

Probability theory is a mathematically elegant theory. The whole construc-

tion can be based on the following three simple rules:

1. The Product rule, which gives the probability of the logical conjunction

14

of two events A and B,

P (A ∩B) = P (A|B)P (B) .

This can be generalized for N events, giving the chain rule

P

(N⋂

i=1

Ai

)

=i−1∏

i′=1

P

(

Ai

∣∣∣∣∣

i⋂

i′=1

Ai′

)

, i′ < i .

This will be valuable for reasoning in Bayesian networks later.

2. Bayes’ rule, which is a recipe that tells us how to update our knowledge

in the presence of new information, and can directly be derived from

the definition of conditional probability and the product rule,

P (A|B) =P (B|A)P (A)

P (B), P (B) 6= 0 .

3. Marginalization: given a joint density, pXY (x, y), get the marginal den-

sity of X or Y by integration (i.e. ‘integrate out’ the uncertainty in

one variable):

pX(x) =

ˆ

Y ∈YpXY (x, y)dy .

In principle, this is everything we need to know in order to perform prob-

abilistic modelling and inference.

15

2.2 Second order decompositions: Singular

value decomposition and Principal Com-

ponent Analysis

Let X be a M ×N rectangular matrix

XM×N

=

x1,1 · · · x1,N

.... . .

...

xM,1 · · · xM,N

and assume without loss of generality that M ≥ N . The singular value

decomposition (SVD) is a factorization of matrix X such that

X = USVT , (2.1)

where the M ×M orthogonal matrix U =[ui

]is called the left eigenvector

matrix of X, and the N × N orthogonal matrix VT =[vT

i

]is its right

eigenvector matrix. The square roots of the N eigenvalues of the covariance

matrices1 XXT and XTX are the singular values of X, σi =√λi, forming

the diagonal matrix S = diag (σi). The singular values are nonnegative and

sorted in decreasing order, such that σ1 ≥ σ2 ≥ · · · ≥ σN , forming the

spectrum of X.

The singular value decomposition of X can be also written as

X =r∑

i=1

σiuivT

i , (2.2)

1Note that XTX = VS2VT and XXT = US2UT.

16

where ui is the i–th eigenvector of XXT and vi is the i–th eigenvector of

XTX, as above, and r ≤ N is the rank of X. In other words, a matrix, X,

can be written as a linear superposition of its eigenimages, i.e. a sum of the

outer products of its left and right eigenvectors, uivT

i , weighted by the square

roots of the eigenvalues, σi. The important fact here is that often relatively

few eigenvalues contain most of the ‘energy’ of matrix X. Now if r < N , the

energy of a data matrix, X, can be captured with fewer variables than N ,

since the relevant information is contained in a lower-dimensional subspace of

the measurement space. This is a form of dimensionality reduction. Note that

due to the presence of noise in the data we may actually have r = N , though.

This, however, also hints at a denoising scheme in which one regards the

smaller eigenvalues as corresponding to the noise and then forms a truncated

SVD, Xt = UtStVT

t , where t < r. Xt is the unique minimizer of ‖X −Xt‖among all rank-t matrices under the Frobenius norm and also a minimizer

(perhaps not unique) under the 2–norm.

Remark 1 For many important applications, such as fMRI and other biosig-

nals, the signal of interest represents only a small part of what is measured

(Lazar, [106]). Consequently, an optimization criterion that searches for

components with maximum signal power, such as PCA, will fail to recover

the signals we are looking for. Methods that exploit higher-order statistics in

the data are therefore needed. Second-order methods can still be very useful

as a preprocessing step, however, and are often used as such.

17

2.3 Higher-order decompositions: Indepen-

dent Component Analysis

In this section we review the independent component analysis (ICA) ap-

proach to source separation, with an emphasis on the aspect of non-gaussianity.

Methodological and review literature includes [148], [87], [160], [75]. Addi-

tional resources are given below.

In the noiseless setting, the observation model for ICA is

x = As , (2.3)

where we have assumed that the observations have been de-meaned (i.e. we

have translated the coordinate system to the data centroid). ICA employs the

principle of redundancy reduction (Barlow, [11]) embodied in the requirement

of statistical independence among the components (Nadal and Parga, [130]).

In statistical language, this means that the joint density factorizes over

latent sources:

P (s) =L∏

l=1

Pl(sl) , (2.4)

where P (s) is the assumed distribution of the sources, s = (s1, . . . , sL), re-

garded as stochastic variables, and pl(sl) are appropriate non-gaussian priors.

Non-Gaussianity is the defining characteristic of the ICA family with respect

to PCA. We seek non-Gaussian sources for two, related, reasons:

• Identifiability,

• “Interestingness”.

Gaussians are not interesting since the superposition of independent sources

tends to be Gaussian. The concept of interestingness is directly exploited

in the related method of Projection Pursuit (Friedman and Tukey, [69]),

18

where the goal is to find the projection directions in a data set that show the

least Gaussian distributions. An important result relating to the probability

densities of the individual sources, due to Comon and based on the Darmois-

Skitovitch theorem2 [43], formalizes the above and states that for analysis

in independent components at most one source may be Gaussian, in order

for the model to be estimable [39]. Geometrically, this indeterminacy of

Gaussian point clouds is due to the rotational invariance of the Gaussian

distribution under orthogonal transformations (Hyvarinen, [90]). Gaussian

point clouds are optimally described in terms of the PCA decomposition

method. Figure 2.1 (left) Lewicki and Sejnowski (right). This, geometric

view will be extended in section 5.2.

A related concept is that of linear structure (Rao, [145]; Beckmann and

Smith, [15]). A vector, x, is said to have a linear structure if it can be

decomposed as x = µ+ As, where s is a vector of statistically independent

random variables and the matrix A is of full column rank. Beckmann and

Smith use results from Rao [145] in order to ensure uniqueness of their ICA

decomposition. In particular, they use the fact that conditioned on knowing

the number of sources and the assumption of non-Gaussianity, there is no

non-equivalent decomposition into a pair (A, s), that is a decomposition with

mixing matrix that is not a rescaling and permutation of A.

Equation (2.4) is equivalent to minimizing the mutual information among

2The Darmois-Skitovitch theorem reads:

Theorem 1 (Darmois-Skitovitch) Let ξ1, . . . , ξn be independent random variables andlet αi and βi, i = 1, . . . , n be nonzero real numbers such that the random variables∑n

i=1 αiξi and∑n

i=1 βiξi are independent. Then the ξi’s are Gaussian.

See, for example, V. Bogachev, ‘Gaussian measures’ [19], p. 13.

19

Figure 2.1: Seeking structure in data. Component analysis can be viewedas a family of data representation methods. The challenging task is to find“informative” directions in data space. These correspond to the column vec-tors of the observation (transformation) matrix and form a new coordinatesystem. The directions are non-orthogonal in general. (Left) Rotational in-variance of the distribution of independent Gaussian random variables. Ascatterplot (point cloud) drawn from two such Gaussian sources illustratesthe fact that there is not enough structure in the data in order to find char-acteristic directions in data space. Algebraically, we can only estimate thelinear map up to an orthogonal transformation. (Center) Point cloud gener-ated from non-Gaussian distribution. (Right) The data cloud contains morestructure in this case, which we want to exploit. In particular, the geometricshape of the point cloud of this figure is an example of a dataset that issparse with respect to the coordinate axes shown by the two arrows.

the inferred sources3 [16],

min I(s1, . . . , sL), where

I(s1, . . . , sL) =´

ds1 · · ·´

dsL p(s1, . . . , sL) log p(s1,...,sL)Q

l p(sl)

,

3For two stochastic variables X and Y to be independent, it is necessary and sufficientthat their mutual information equals zero:

I(X, Y ) = H(X) + H(Y )−H(X, Y )

=

ˆ

dXdY PX,Y (X, Y ) log PX,Y (X, Y )

−ˆ

dXPX(X) log PX(X)−ˆ

dY PY (Y ) log PY (Y ) = 0 ,

where the quantity H(Z) is the ‘differential’ entropy of the random variable Z.

20

or, equivalently, the distance between the distribution p(s) and the fully fac-

torized one,∏

l p(sl), measured in terms of the Kullback-Leibler divergence,

KL[p(s)||∏l p(sl)

]. This enables them to separate statistically independent

sources, up to possible permutations and scalings of the components [39].

The mutual information (“redundancy”) can be equivalently computed as

I(s1, . . . , sL) =

L∑

l=1

H(sl)−H(s1, . . . , sL) ,

where the first term at the rhs is the sum of the entropies of the individual

sources and the second the joint entropy of (s1, . . . , sL). As shown by Bell &

Sejnowski (1995), independence can lead to separation because the method

exploits higher-order statistics in the data, something that cannot be done

with methods such as PCA.

In practice, many ICA algorithms minimize a variety of ‘proxy’ func-

tionals. Bell and Sejnowski’s ICA approach uses the InfoMax principle

(Linsker, [113]), maximizing information transfer in a network of nonlinear

units (Bell & Sejnowski, [16]). Based on this, Bell and Sejnowski derive their

very successful Infomax-ICA algorithm. The sources are estimated as

s = u = Wx , (2.5)

where W is the separating (unmixing) matrix that is iteratively learned by

the rule

W←W + η(

I− E[φ(u)

]uT

)

W , (2.6)

where we have incorporated Amari et al.’s natural gradient descent approach

[2]. The vector valued map φ(u) = (φ1(u1), . . . , φL(uL)) is an appropriate

nonlinear function of the output, u, such as a sigmoidal ‘squashing’ function,

applied component-wise. Popular choices are the logistic transfer function,

21

φ(u) = 11+e−u , and hyperbolic tangent, φ(u) = tanh(u). Bell and Sejnowski

show that optimal information transfer, that is maximum mutual informa-

tion between inputs and outputs, or equivalently maximum entropy for the

output, is obtained when highly-slopping parts of the transfer function are

aligned with high-density parts of the probability density function of the in-

put. The expectation operator, E[ · ], is approximated by an average over

samples in practice. Finally, the factor η is an appropriate learning rate.

Hyvarinen chooses to focus explicitly on non-Gaussianity and derives a

fixed-point algorithm, dubbed FastICA [89]. Non-Gaussianity can be quan-

tified using the negentropy, J ,

J(u) = H(uGauss)−H(u) ,

where uGauss is a Gaussian random variable with the same covariance as u.

The FastICA algorihm maximizes an approximation of J using the estimate

J(ul) ≈

E[G(ul)

]− E

[G(uGauss)

]2

,

where G(·) is an appropriate nonlinearity, such as the non-quadratic function

G(z) = z4, and implicitly related to the source distributions (see below) ,

uGauss is a standardized Gaussian r.v., and u1, . . . , ul, . . . , uL are also of mean

zero and unit variance. The unknown sources, ulLl=1, are again estimated

using the projections ul = wT

l x, where wl is the l–th separating vector

(column of W), found by the iteration

w← E[xg(wTx)

]− E

[g′(wTx)

]w ,

where g(·) is the derivative of G(·), and w is each time rescaled as w ←w

‖w‖ . For an application of the non-Gaussianity principle to fMRI see the

22

Probabilistic ICA algorithm of Beckman and Smith [15].

The ICA decomposition can be written as a factorization of an “unfold-

ing” matrix times a rotation matrix. The former is usually implemented

by pre-whitening (pre-sphering) the observations, such that E[xxT

]= ID,

where x now denotes the whitenned observations:

x = Wsphx .

Wsph can be computed from the eigendecomposition of the data covariance

matrix, Cxx = E[xxT

] .= UΛUT, where the matrix U is a unitary matrix

containing the eigenvectors of Cxx and Λ = diag(λ1, . . . , λD) is the diago-

nal matrix of eigenvalues. Then the decomposition problem can be written

(taking the “square root” and inverting) as

x = Λ− 12 UTAs = WsphAs = As, i.e. A = W−1

sphA .

That Λ− 12 UT spheres the data can be seen by simply performing the opera-

tions for E[xxT

], taking into account that U is an orthogonal matrix [81].

The above whitening operation transforms the original data vectors to the

space of the eigenvalues and rescales the axes by the singular values. Alter-

natively, one may use UΛ− 12UT for whitening, which maps the data back to

the original space. This often makes further processing easier. In any case,

since the whitening transformation removes any second-order statistics (cor-

relations) in the data, learning the ICA matrix A is equivalent to learning a

pure orthogonal rotation matrix:

E[xxT

]= AE

[ssT]AT = AAT = I .

23

2.3.1 Probabilistic Inference for ICA

Many classical component analysis algorithms can be interpretated under the

same probabilistic framework, as generative models. The model we consider

here is the transformation

x = As + ε , (2.7)

where an L–dimensional vector of latent variables, s, is linearly related to a

D-dimensional vector of observations via the observation operator A. Obser-

vation noise, ε, may in general be added to the observations. In other words,

the observed data is ‘explained’ by the unobserved latent variables, while the

mismatch between the observations and the model predictions, x − As, is

explained by the additive noise. The fundamental equation of ICA, which

we write again below,

P (s) =L∏

l=1

Pl(sl) , (2.8)

can be seen as a modelling assumption (i.e. a working hypothesis), as a fac-

torization of a multi-dimensional distribution into a product of simpler one-

dimensional distributions, in another interpretation. Classical ICA models

such as Infomax ICA and FastICA assume noiseless and square mixing. This

restriction is removed in more recent algorithms.

Remark 2 The generative model of Eqns (2.7), (2.8) defines a constrained

probability distribution in data space. Referring back to Fig. 2.1, the “arms”

of the point-cloud are oriented along the directions of the “regressors”, which

are encoded in the column vectors of the mixing matrix. Thus, when defining

and learning a probabilistic ICA model, we are are in fact defining at least

three things: the source distributions, the mixing matrix, and the noise model.

The above remark will prove to give an insight into why ICA algorithms are

so successful in decomposing certain types of data such as fMRI later.

24

In the general, noisy and non-square mixing case, one can formulate the

penalized optimization problem (see e.g. [118], [26], [141], [169], and [163]

for a nice concise review)

s = argmaxs

− 1

2σ2‖x−As‖2 +

L∑

l=1

log pl(sl)

, (2.9)

assuming spherical Gaussian noise, ε ∼ N (0, σ2IL), for example, in order to

reconstruct the sources from the inputs at their most probable value.

As shown by MacKay [118] and Pearlmutter and Parra [141], Infomax-

ICA can be interpreted as a maximum likelihood model. Assuming square

mixing (i.e. as many latent dimensions as observations, L = D), and invert-

ibility of the mixing matrix, the separating matrix is W = A−1. We can

then immediately write down the probability of the data, as

p(x) = | det(J)|p(s) ,

where J is the Jacobian matrix of the transformation, with Jl,d = ∂sl

∂xd. Using

the fundamental assumption of ICA, of mutual independence of the latent

variables, p(s) =∏L

l=1 p(sl), and under the linear model we have

p(x) = | det(W)|∏

l

p(sl) .

The log–likelihood of an i.i.d. data set, X = xnNn=1, can then be written

as

L(θ)def= log p(X |θ) = log

(N∏

n=1

p(xn|θ))

= N log | det(W)|+N∑

n=1

L∑

l=1

log(pl

(wT

l xn

)),

where we have substituted sl,n with ul,n = wT

l xn =∑D

d=1 wl,dxd,n. The pa-

25

rameter vector, θ, here contains the matrix, A, or equivelently the unmixing

one, W = A−1, since these are uniquely related here.

We can now derive a maximum likelihood algorithm for ICA via gradient

descent, in order to learn the separating matrix, W. Taking the derivative

of L(θ) with respect to W and using well-known derivative rules we finally

find the learning rule

∂

∂WldL(θ) = Ald + zlxd ,

where we have used the shorthand notation zl = φl(ul), where the ICA

nonlinearity is the score function of the sources, φl(sl) = − ∂∂sl

log pl(sl),

where pl(sl) are the assumed source priors. Multiplying with WTW, to

make the algorithm covariant [118], we get exactly the Infomax-ICA update

rule, Eq. (2.6). Note that the above multiplication is equivalent to using the

‘natural gradient’ approach of Amari [2], a learning algorithm based on the

concept of information geometry.

The FastICA algorithm can be interpreted as an instance of the EM

algorithm [48]. Lappalainen [103] derives it as an algorithm that filters

Gaussian noise. The general idea of the EM algorithm is to estimate the

latent variables, Y, and model parameters, θ, of a probabilistic model (which

in this case are the sources, S, and mixing matrix, A, of the BSS problem), in

two alternating steps. The ‘E’ (expectation) step computes the expectation

of the log–likelihood with respect to the posterior distribution p(

Y∣∣∣X, θ(τ)

)

using the current (τth) estimate of the parameters, giving the so-called ‘Q’–

function,

Q(

θ∣∣∣θ

(τ))

= EY|X,θ(τ)

[

logL(θ;X,Y)]

,

which is a function of θ only, and the ‘M’ (maximization) step, which com-

26

putes the model parameters that maximize the expected log–likelihood,

θ(τ+1) = argmaxθ

Q(

θ

∣∣∣θ

(τ))

.

Applyting the above generic EM recipe, we can compute the maximum

likelihood estimate of the mixing matrix of our ICA model as

A =(

X E [S]T) (

E[SST

])−1,

where the expected sufficient statistics4 of the sources, E [S] and E[SST

],

are computed with respect to their posterior5. In the low-noise (σ2 → 0)

and square-mixing case of FastICA, Lappalainen approximates the posterior

mean as

s = E[s∣∣A,x, σ2ID

]≈ s0 + σ2

(ATA

)−1φ(s0) ,

where s0def= A−1x and the function φ(·) is defined as before. For prewhitened

data, this expression simplifies even more, since A is orthogonal, and there-

fore,(ATA

)−1= IL. Now Lappalainen makes the crucial observation that

while the EM algorithm has not yet converged to the optimal values, the

sources, s0, can be written as a mixture

s0 = αsopt + βsG ,

where the “noise” sG is mostly due to the other sources not having been

perfectly unmixed. (When far from the optimal solution, we have β ≈ 0

and α ≈ 1.) Using an argument based on the central limit theorem, as

the number of the other sources becomes large he approximates the mixing

4A sufficient statistic is the minimal statistic that provides sufficient information abouta statistical model. Typically, the sufficient statistic is a simple function of the data, e.g.the sum of all the data points, sum of squares of the data points, etc.

5These are relationships that will become useful later in the Thesis as well.

27

matrix corresponding to those other sources as

aG ≈ a + σ2XGφ(s0G)T/L ,

where XG are Gaussian-distributed “sources” with the same covariance as

X, as is done in the standard FastICA algorithm, and the sources s0G are

s0Gdef= aTXG. Then the update equation for the mixing matrix, normalized

to unity, is

anew =a− aG

‖a− aG‖≈ σ2

[Xφ(s0)

T −XGφ(s0G)T]/L

‖a− aG‖.

The final step that will bring us to the standard FastICA is to note that the

term XGφ(s0G)T/L is equal to as0Gφ(s0G)T/L, where the factor s0Gφ(s0G)T/L

is constant, and therefore the numerator of the update equation becomes the

standard FastICA update, a− aG = Xφ(s0)T − ca.

While Teh [163] computes the data likelihood in a maximum likelihood

framework, Knuth [99] uses a maximum a-posteriori framework. The latter

allows us to impose constraints on the model parameters as well. This was

further explored in Hyvarinen and Karthikesh in [88]. in order to impose

sparsity on the mixing matrix.

Up to now we have either assumed equal number of sources and sensors

or we have implicitly assumed that their number is somehow given. Roberts

[147] derives a Bayesian algorithm for ICA under the evidence framework that

estimates tha most probable number of sources as a model order estimation

problem. The evidence framework makes a local Gaussian approximation to

the likelihood conditioned on the mixing matrix, but takes into account the

local curvature by estimating the Hessian. Due to computational reasons,

this is approximated by a diagonal matrix here, setting the off-diagonal ele-

ments to zero. The noise width, regarded as a hyperparameter, is computed

28

at its maximum likelihood value.

Finally, Choudrey et al. [36] and Miskin and MacKay [128] propose a fully

Bayesian approach to ICA using a variational, ensemble learning approach

under a mean-field approximation. They use a flexible source model based

on mixtures of Gaussians and perform model order estimation using a variety

of techniques. The variational Bayesian approach to probabilistic inference

will be further elaborated upon in Chapter 5.

We can now select an appropriate functional form for the individual

marginal distributions, pl(sl,n), based on our prior knowledge about the prob-

lem, as was done in the original formulation of InfoMax ICA of Bell & Se-

jnowski for the separation of speech signals, for example. The source model

should model the real source distributions as accurately as possible. Many

natural signals exhibit characteristic amplitude distributions, which can pro-

vide some guidance and indeed should be exploited when possible. This

allows us to utilize fixed source models in our separation algorithms. Bell

and Sejnowski, for example, use several nonlinearities (recall that these are

uniquely related to the assumed PDFs of the sources), such as 1/(1 + e−ui),

tanh(ui), e−u2

i , etc., as well as propose general-purpose ‘score functions’ (see

Figure 2 of Ref. [16]) in their Infomax-ICA algorithm. FastICA uses nonlin-

earities such as u3i , tanh(αu), uie

−αu2i /2, and u2

i . However, this is not always

possible. The problems that can arise from an incorrect latent signal model

and possible solutions are discussed in section 2.5.

As noted by Cardoso [28], however, non-Gaussianity is not the only possi-

ble route to independent component analysis, and to blind source separation

in general; other possibilities also exist—including exploiting non-stationarity

and time-correlation in signals. Such a different paradigm, sparsity, in com-

bination doing away with the assumption of independence, will be explored

next.

29

2.4 Sparse Decompositions

2.4.1 Parsimonious representation of data

ICA works well for a variety of blind source separation problems. However,

in order for the decomposition to make sense the true sources must them-

selves be (nearly) independent. This may make sense in the separation of

voice signals that are independently generated by people with no interaction

among them, for example. Recently, another paradigm for BSS, and inverse

problems in general, sparsity, has emerged. Sparsity refers to the property

of a representation to form compact encodings of signals, data, or functions,

using a small number of basis functions. Those basis functions are used as

“building blocks” to build more complex signals.

There has been a variety of algorithms for sparse representation, or sparse

coding, originating from the neuroscience and neural networks communities

as well as several others from a signal processing perspective. Sparse decom-

position, and ways to impose sparsity constraints, has recently been topic of

much research in the statistics and machine learning literature.

In the study of the visual system, Field [62] proposed sparsity as an

organization principle of the visual receptive field. He conjectured that pop-

ulations of neurons optimize the representation of their visual environment

by forming sparse representations of natural scenes, a hypothesis that has

high biological plausibility since it is based on the general idea of a system

using its available resources efficiently. According to his theory, the visual

system performs efficient coding of natural scenes in terms of natural scene

statistics by finding the sparse structure available in the input. Field’s theory

directly reflects the idea of redundancy reduction of Barlow [11], [12].

Olshausen and Field [137] further test the above theory, seeking experi-

mental evidence for sparsity in the primary visual cortex (V1) by building a

30

Figure 2.2: The Olshausen and Field model [137] as a network. The outputsof the network are the coefficients, ai. The symbol r(x) is the residual image,r(x) =

[I(x)−∑i aiφi(x)

].

predictive (mathematical) model of sparse coding. In their model, images are

formed as a linear combination of local basis functions with corresponding

activations that are as sparse as possible. These bases model the V1 receptive

fields and form overcomplete sets adapted to the statistics of natural images.

Formally, the model of Olshausen and Field is described by:

xp ≃∑

i

ai,pφi ,

where xp is an image “patch” (i.e. a small image window) and φi are the

underlying basis elements. A network representation of their model is shown

in Fig. 2.2. They proposed the following objective:

I(Φ) = minai,p

∑

p

∥∥∥∥∥xp −

∑

i

ai,pφi

∥∥∥∥∥

+ λ∑

i

log(1 + ai,p

2)

,

to be minimized over bases, Φ, learned by searching for a basis that opti-

mized the sparsity of the coefficients, ai,p, (subject to the appropriate scale

31

Figure 2.3: Activities resulting from the model of Olshausen and Field [137].The input image on the left is reconstructed from learned bases using thealgorithm of O. & F. Note how the coefficients, ak, resulting from the modelare highly sparse, compared to reconstructing the image patch using randombases or pixel (canonical) bases.

normalization of φi). In general, the basis set can be overcomplete. That

is, the number of bases in Φ, |Φ|, can be greater than the dimensionality of

the ‘input’ data space (see for example [138]). The reason for this is that

the ‘code’ can be more sparse if one allows an overcomplete basis set. See

also Asari, [3]. As shown in Fig. 2.3 this objective results in highly sparse

distributions for the coefficients. Astonishingly, the learned receptive fields

(filters) have properties that resemble the properties of natural simple-cell

receptive fields, that is they are spatially localized, oriented and bandpass,

i.e. selective to structure at different spatial scales (Fig. 2.4).

In the signal processing community, Mallat and Zhang [122] proposed

a greedy algorithm analogous to the projection pursuit in statistics, called

‘matching pursuit’, that iteratively finds the best matching projections of

signals onto a fixed overcomplete dictionary of time-frequency ‘atoms’. Lin-

ear combinations of those atoms form compact representations of the given

signal.

32

Figure 2.4: Learned receptive fields (filters) from the sparse coding algorithmof Olshausen and Field [137]. These filters exhibit properties of simple-cellreceptive fields such as locality, orientation and spatial selectivity.

Remark 3 A geometric interpretation of sparse representation is depicted

in Fig. 2.5. Each data vector can be viewed as a point in a D–dimensional

vector space, the whole dataset forming a cloud of points. We now seek a

linear transformation of the dataset such that the inferred “projections” on

to the new coordinate system defined by the column vectors of the learned

transformation matrix, A =[al

]L

l=1, are as sparse as possible.

Note that it is the sparseness of the components (and the selection of a suit-

able model prior) that drives learning of the new representation (unmixing)

directions. This sparseness is reflected in the shape of the point-cloud: re-

ferring to the above figure (where D = L.= 2), sparse data mapped in to the

latent space produce a highly-peaked and heavy-tailed distribution for both

axes (Fig. 2.5 (right)). This is indeed a result of the sparseness property of

the dataset: the two ‘arms’ of the sparse data cloud are tightly packed around

the directions of the unmixing vectors, al. Algebraically, this means that for

a particular point, n, either the coefficient s1,n (l = 1) or the coefficient s2,n

(l = 2) is almost zero, as the particular datum is well described by the a2 or

33

Figure 2.5: Geometric interpretation of sparse representation. State-spacesand projections of two datasets, one sparse (right) and the other non-sparse(left), are shown. Each dataset, plotted in the measurement coordinate sys-tem, (A,B), (‘state-space’ in the terminology of Field), produces a pointcloud (top part of the figure) — for visualization purposes, both observationand latent dimensionalities are equal to D = L = 2 in this figure. By project-ing the point clouds on to each coordinate we can produce the correspondingempirical histograms of ‘state’ amplitudes. We now seek a (linear) trans-formation to a latent space, (A′, B′), that optimizes some suitable criterion(this is shown in the bottom part of the figure). Sparse data mapped in thelatent space produce heavy-tailed distributions for both latent dimensions.(Figure from D. J. Field, Phil. Trans. R. Soc. Lond. A, 1999.)

the a1 “regressor”, respectively. On the contrary, non-sparse data will typi-

cally produce a projection that corresponds to a “fat” empirical histogram,

as shown in Fig. 2.5 (lower-left).

With respect to the soft clustering view of component analysis (Miskin,

[94]), discussed in the Introduction of the Thesis, if the data vectors are

sufficiently sparse, their images on the unit hypersphere SD−1 (i.e. the ra-

dial sections of their position vectors with the unit hypersphere, mapped as

xn ∈ ED 7→ xn ∈ S

D−1, where the operator u = u‖u‖ maps vectors along their

34

Figure 2.6: Clustering of a sparse set of points on the unit hyper-sphere SD−1. The points cluster around the direction vectors corre-sponding to the columns of the mixing matrix. (Figure was taken fromhttp://cs.anu.edu.au/escience/lecture/cg/Illumination/.)

radii) concentrate around the unit vectorsal

L

l=1; see Fig. 2.6 [176]. While

Miskin did not use this property per se for sparse decomposition, one can

design separation algorithms that exploit it [111]. We shall revisit this in the

applications section of chapter 3.

2.4.2 Sparse Decomposition of Data Matrices

Inspired by the model of Olshausen and Field, Donoho [50] comments on

the connection and differences between the two lines of research: indepen-

dent component analysis and sparse decompositions. He promotes the idea

of sparsity, overcompleteness, and optimal atomic decompositions as a better

goal than independence. He provides a rationale of why sparsity is a more

plausible principle, being “intrinsically important and fundamental”, due to

both biological and modelling reasons. Regarding the former, he too cites

the extremely efficient sparse representation achieved by the human visual

system, and its higher compression performance compared to the best engi-

neered systems. With respect to the latter, he notes that independence is

inherently a probabilistic assumption, and of unknown interpretability (with

35

respect to vision) because natural images are composed by occlusion. Occlu-

sion inevitably creates dependent components. He finally suggests that one of

the future challenges of ‘sparse components analysis’ would be to search over

spaces of objects of much larger scale than the image patches of Olshausen

and Field.

It turns out (see Olshausen, [138]) that the Infomax-ICA algorithm be-

comes, in fact, a special case of the sparse linear algorithm of Olshausen and

Field when there is an equal number of basis functions/latent dimensions

and inputs, the φis are linearly independent, and there is no observation

noise. In this case, there is a unique set of coefficients ai that is the root

of ‖X −Φa‖, and we can write a as a = WX, where W = Φ−1 (note that

based upon the above assumptions, Φ becomes invertible). If, in addition,

the ICA nonlinearity is chosen to be the cumulative density function of the

sparse components, then the sparse algorithm gives exactly the algorithm of

Bell and Sejonwski. The point here is actually to show that sparsity con-

straints can lead to separation. Indeed, as pointed out by Li, Cichocki and

Amari [111],

Remark 4 Sparse decompositions of data matrices can be used for the blind

source separation problem.

They provide various examples from simulations and EEG data analysis that

demonstrate the performance of sparse decompositions in signal separation.

Li, Cichocki and Amari performed a sophisticated mathematical analysis for

the case of sparse representation of data matrices under the ℓ1 prior, for given

basis matrices. They tackle the two-step decomposition problem of learning

the base matrix first, via clustering, and then estimating the coefficients of

the decomposition. If X is a data matrix and A = al is a given basis, Li

36

et al. start from the mathematical model shown below:

min L∑

l=1

N∑

n=1

|sln|︸︷︷︸

S(S)

∣∣∣ subject to AS = X

, (2.10)

with S(·) the sparsity function on the sources. This particular case of opti-

mization problem can then be solved using linear programming. While the

ℓ0–norm solution is the sparsest one in general, its optimization is a non-

trivial combinatorial problem. Li et al. show that, for sufficiently sparse

signals, the solutions to the problem of sparse representation of data matri-

ces that are obtained using the ℓ0 and ℓ1 norms are equivalent. This fact

was previously shown by Donoho and Elad [51] but Li et al. give a less strict

sparseness ratio (i.e. the ratio of zero versus non-zero elements).

Li et al. [111] also show that the above problem has a unique solution.

While in general there would be an infinite number of solutions for the un-

derdetermined system of equations

As = x ,

where the D × L matrix A (observation operator) with L > D maps the

unknown signal s in to the observed signal x, the sparsity constraint makes

the particular linear inverse problem well-posed. A geometric interpretation

of why ℓ1–type sparsity regularization works well for signal recovery under

sparsity constraints is shown in Fig. 2.7. We want to find the optimal x as

the minimum-norm vector that satisfies the constraint x = As, i.e. such that

the hyperplane does not intersect the ℓ1 ball. More generally, the problem

37

Figure 2.7: A geometric intuition into sparse priors. We seek the sparsestvector x ∈ R

N , under the ℓ1 norm in this case, that satisfies the linearconstraint y = Φx, where Φ is a dictionary. The ℓ1 penalty correspondsgeometrically to the rhomboid (‘ℓ1 ball’) and the linear constraint to thehyperplane. The shape of the rhomboid dictates the form of the solution.The optimal vector, x, is the one that touches the hyperplane, without thelatter intersecting the rhombus. As can be seen from the figure, the ℓ1 normnecessarily drives all coordinates but one towards zero, leading to sparsesolutions. (Figure from Baraniuk [9].)

38

can be stated (in the deterministic framework) as:

mins

‖s‖1 : ‖As− x‖ < c

(Chen and Haykin, [33]), where x can be a “corrupted” (noisy, blurred, etc)

version of the original signal and c is a positive scalar constant that plays

a role similar to the noise variance in the probablistic framework (Li et

al., [111]). In this case, the hyperplane becomes an orthotope (hyperrect-

angle), defining a “zone” in which the vertex of the ℓ1 ball must fall. Li et

al. [111] use k–means clustering to get an estimate of the basis, which is then

used in a linear programming algorithm in order to estimate the coefficients

of the representation.

Lewicki and Sejnowski [109], introduce a probabilistic method for sparse

overcomplete representations. A Laplacian prior on the coefficients of the

basis was used, p(sl) ∝ e−θ|sl|, enforcing parsimonious representations. They

then propose a gradient optimization scheme for maximum a-posteriori (MAP)

learning. For the linear model x = As + ε, with Gaussian observation noise

with variance σ2, we seek the most probable decomposition coefficients, s,

such that

s = argmaxs

p(x|A, s)p(s)

. (2.11)

The probability of a single data point is obtained by integrating out the

unknown signals, s:

p(x|A) =

ˆ

ds p(s)p(x|A, s) .

In order to derive a tractable algorithm, they make a Laplace approxima-

tion to the data likelihood, by assuming that the posterior is Gaussian

around the posterior mode. This involves computing the Hessian H =

39

∇s∇s − log [p(s)p(x|A, s)] = 1σ2 A

TA −∇s∇sp(s). To make a smooth ap-

proximation of the derivative of the log–prior, and a diagonal approximation

to the Hessian, they then take p(sl) ≈ cosh−θ/β(βsl), which asymptotically

approximates the Laplacian prior for β → ∞. Moreover, a low noise level

is assumed. The above approximations finally lead to the gradient learning

rule

∆A = ATA ∇A log p(x|A) ≈ −A(I + zsT

),

where, again, zl = ∂ log p(sl)/∂sl. Note that this has the same functional

form as the Infomax-ICA learning rule, however the basis matrix is generally

non-square here. In contrast to the standard ICA learning rule, where the

sources are estimated simply by s = Wx, where the unmixing matrix is

W = A−1, here we must use a nonlinear optimization algorithm in order to

estimate the coefficients, using Eq. (2.11). Due to the low-noise assumption,

the level of the observation noise is not estimated from the data and has

to be set manually. Lewicki and Sejnowski’s algorihm, however, is faster in

obtaining good aproximate solutions than the linear programming method

and is more easily generalizable to other priors.

Girolami [76] proposes a variational method for learning sparse represen-

ations. In particular, his method offers a solution to the problem of analyti-

cally integrating the data likelihood, for a range of heavy-tailed distributions.

Starting from the heavy-tailed distribution p(s) ∝ cosh− 1β (βs), he derives a

variational approximation to the Laplacian prior by introducing a variational

parameter, ξ = (ξ1, . . . , ξL), such that the prior p(s) =∏L

l=1 exp(−|sl|) be-

comes p(s; ξ), with s|ξ ∼ N (s; 0,Λ) and Λ = diag (|ξl|). Then p(s) is the

supremum

p(s) = supξ

[L∏

l=1

ϕ(ξl)

]

N (s; 0,Λ)

,

40

with ϕ(ξ) = exp(−12|ξ|)√

2π|ξ| as β → ∞. The above is derived using a

variational argument and using convex duality [95], [139]. In essence, what

this approximation means is that, at each point of its domain, the intractable

prior is lower-bounded tightly by a best-matching Gaussian with width pa-

rameter ξ, with this variational parameter being estimated by the algorithm

along with the model parameters. Using the above, the posterior takes a

Gaussian form. This enables him to derive an EM algorithm in order to infer

the sparse coefficients and learn the overcomplete basis vectors of the repre-

sentation. Girolami applies his sparse representation algorihm to the problem

of overcomplete source separation and achives superior results compared to

the algorithm of Lewicki and Sejnowski.

The problem of sparsely representating a data matrix described above is

a special case of the more general problem of recovering latent signals that

themselves have a sparse representation in a signal dictionary (Zibulevsky

et al., [176]). Many real-world signals have sparse representations in a

proper signal dictionary but not in the physical domain. The discussion

in Zibulevsky et al. is motivated by starting from the case of representing

sparse signals in the physical domain, depicted in Fig. 2.5, and then noting

that the intuition there carries over to the situation of sparsely recovering sig-

nals in a transform domain. The latter problem is one of the main problems

dealt with in this Thesis and is elaborated in Chapter 3.

2.5 The Importance of using Appropriate La-

tent Signal Models

Many classical ICA algorithms, such as Infomax-ICA and FastICA, allow

the plug-in setting of the respective nonlinearity function in the system, as

mentioned above. For successful separation, the form of the nonlinearity

41

-30

-20

-10

0

10

20

30

-40 -30 -20 -10 0 10 20 30 40

True distribution

-10

-5

0

5

10

-10 -5 0 5 10

Hypothesized distribution

-30

-20

-10

0

10

20

30

40

-30 -20 -10 0 10 20 30

Estimated distribution

Figure 2.8: Effect of an incorrect source model specification. Left: true dis-tribution; Middle: Hypothesized distribution; Right: Estimated distribution.(Figure adapted from Cardoso [27].)

must somehow match, as far as possible, the underlying (unknown) statistical

properties of the sources, such as their super- or sub-gaussianity. This was

first stated as “matching the neuron’s input-output function to the expected

distribution of the signals” in [16]. Since the estimating equations for the

mixing matrix and sources are coupled, the functional form of the nonlinear-

ity is critical for their correct estimation: an incorrect choice of nonlinearity

will lead to an incorrect estimation of the (un-)mixing matrix, which will map

the observations back to the source space incorrectly, etc. Cardoso [27] gives

a compelling example of how estimation can go wrong. Another example of

how classical ICA fails in separating sources in an image processing context

is given in Fig. 2.9 (from Tonazzini et al., [167]).

Remark 5 Tonazzini et al. use a Markov random field in order to impose

an image prior. However, the images of Fig. 2.9 (left) are actually also

prime examples of sparse sources. In [77] and [46], an extensive study of

how justified and robust are ICA algorithms for functional MR imaging of

the brain was conducted and various simulations of fMRI “brain” activations

under well-controllable situations with shapes similar to that of ref. [167] were

performed that highlighted the need for alternative decomposition algorithms

that are effective for fMRI, based on sparsity. These are also studied in the

experimental section of Chapter 3 using a proposed new algorithm for sparse

BSS.

42

ICA-reconstructed source s2

ICA-reconstructed source s1

Noisy mixture x2

Noisy mixture x1

Source s2

Source s1

Figure 2.9: Effect of an incorrect source model specification for a blind im-age separation problem. Left: true sources; Middle: noisy mixtures; Right:Estimated sources from ICA. The model clearly fails to recover the sources.In particular, one of the sources is not recovered at all.

It can be shown that the Infomax-ICA as well as the FastICA algorithms

are instances of maximum likelihood estimation [118], [141], [86], [103].

Under this interpretation, one can see that the nonlinearity, φ(·), is actually

the logarithmic derivative of the (hypothesized) probability density of the

sources (the ‘score’ function): for the l–th source, sl,

l : φl ([Wx]l) = − ∂

∂sllog pl(sl) = −p

′l(sl)

pl(sl),

where the symbol W denotes the separating operator from observation space

to source space and x is an observation. In other words, in a perfect match

the nonlinearity is exactly the cumulative distribution function of the sources.

Of course we do not know the actual source PDFs, since the sources them-

selves are unobserved, but we may try to estimate them from the data. For

this purpose, we can employ a parameterized model source PDF, pl(sl; θsl),

43

and learn, instead of fix, its parameters, θsl, from the data. A flexible prior

that is at the same time mathematically tractable is a mixture distribution.

Lawrence and Bishop [105] uses a MoG prior for ICA, albeit in a fixed form.

Attias [6] has used mixture of Gaussians (MoGs) as source models for blind

source separation under a maximum likelihood framework, leading to a flex-

ible algorithm dubbed ‘Independent Factor Analysis’ (IFA). Choudrey et al.

[36] and Lappalainen [102] use the same prior under an ensemble learning

approach, i.e. with a factorized posterior (the so-called ‘naive’ mean-field

method).

The l–th model source density, for an Ml–component model, can be ex-

pressed mathematically as a linear combination of component densities,

l : pl(sl,n|θsl) =

Ml∑

ml=1

πl,mlN(sl,n;µl,ml

, β−1l,ml

), (2.12)

where (µl,ml, βl,ml

) are the mean and precision (inverse variance) parameters

of theml–th Gaussian component over sl,n and πl,mlMl

ml=1 are mixing propor-

tions. Since the πl,ms are non-negative the MoG is also a convex combination

of Gaussians. An example of a MoG distribution is shown in Fig. 2.11, which

models the natural image in Fig. 2.10.

Each component in the mixture distribution6 may be interpreted as a

‘state’: for the n–th data instance, the state ξl,n = ml ∈ 1, . . . ,Ml is

probabilistically chosen with prior probability p(ξl,n = ml) = πl,ml. The

set θS =⋃

l θsl= µl,ml

, βl,ml, πl,ml

Ml

ml=1 are the source model parameters

that we shall learn. The collective state variable, ξ = (ξ1, . . . , ξL), lives in

a state-space that is the Cartesian product of the state sets corresponding

to the individual sources: ξ ∈ Ξ = 1, . . .M1 × · · · × 1, . . .ML, with

M = |Ξ| = ∏

l Ml. For high-dimensional problems this model may become

6Note the difference between a mixture of probability densities, p = π1p1 +π2p2 . . ., anda mixture of random variables, x = a1s1 + a2s2 . . ., in this context.

44

50 100 150 200 250

50

100

150

200

250

Figure 2.10: ‘Cameraman’ standard test image. The pixel values were scaledto be in the range [0, 1].

0 0.5 10

2000

4000

6000

−1 −0.5 0 0.5 1 1.5 20

1

2

3

4

Figure 2.11: Density modelling using the MoG model. Empirical histogram(left) and estimated PDF by the MoG (right) of the real-space pixel values of‘Cameraman’ image. Three Gaussian components were used for this image,denoted by the dashed curves. The resulting mixture density is denoted bythe thick back curve. The adaptive image prior correctly infers the PDF. Inparticular, notice how the MoG assigns practically zero probability measureoutside the range [0, 1]. This model can be estimated by an EM algorithm,introduced in Sect. 2.3.1.

45

computationally demanding, since the size of the product space can be very

large, and in this case a variational approximation can be utilized [6].

Remark 6 Going back to the ‘Cameraman’ image and its MoG model, we

note that in general, the number of components, Ml, should ideally reflect

the number of “important” objects in the image. However, since the MoG

density model is a map from physical (geometric) space to intensity frequency

space, the geometric information is lost. Therefore, the above is feasible only

if the various objects have quite different radiometry. But see section 5.2.2,

where the MoG is used as a feature-space density model.

In addition to multimodal distributions, a MoG density can also be used

for modelling sparsity. This latent signal model will be elaborated in sec-

tion 5.2.2, where a fully Bayesian model for sparse matrix factorization is

developed.

2.6 Data modelling: A Bayesian approach

In order to obtain the distribution of the data given the model parameters

we must ‘integrate out’ the latent variables,

p(x|A, σ2

)=

ˆ

ds p(s) p(x|s,A, σ2

), (2.13)

where the second factor in the integral, the likelihood of the data under the

model, captures the stochastic map sp7→ x given the parameters of the model,

(A, σ2). Unfortunately, this integration is in general intractable. There are

several ways to approximate p (x|A, σ2), for example by a Laplace approxi-

mation, introducing additional variational parameters, using mixture models,

or by sampling methods.

46

Bayesian inference provides a powerful methodology for CA [147], [99]

by incorporating uncertainty in the model and the data into the estimation

process, thus preventing ‘overfitting’ and allowing a principled method for

using our domain knowledge. The outputs of fully Bayesian inference are

uncertainty estimates over all variables in the model, given the data.

Bayes’ theorem provides a “recipe” for updating our prior assigmnents in

order to incorporate the observed data:

p(M|D; I) =p(D|M; I)p(M|I)

p(D|I) , (2.14)

where M is a model, D is the data, and I is any background information

we may have. The probability p(M|I) is the prior for M, capturing our

knowledge or expectations about the problem and encoding our “weight”

assignments about the plausible range of values that variables in the model

can take prior to observing any data, p(D|M; I) is the likelihood, that is

the probability of the data under our model, and p(M|D; I) is the posterior,

that is the probability of the model after we have observed the data. In the

Bayesian approach to data modelling, a ‘model’ is the collection of likelihood

and prior specifications for all random variables in the model, along with any

functional relationships among them.

The denominator, p(D|I), is called the ‘evidence’, or marginal likelihood,

and it is a very important quantity: besides it being a normalizing constant,

it captures all the essential information of the probabilistic system7. This

can be better seen if we write it as

p(D|I) =

ˆ

dM p(M|I)p(D|M; I) . (2.15)

7The Bayesian evidence is the equivalent of the ‘partition function’, usually denoted byZ, in Statistical Mechanics. From it several other quantities, such as the expected energy,can be derived.

47

Besides its conceptual elegance, in practical terms Bayes’ theorem offers us a

framework for learning from data as inductive inference. Bayesian modelling

offers the following advantages:

• It forces us to make all our modelling decisions explicit.

• It is a consistent framework for learning under uncertainty.

• It offers modularity, since, due to its probabilistic foundation, models

can be combined or extended in a hierarchical fashion.

• Bayesian models can be extended to learn various hyperparameters,

controlling the distributions of other quantities. Indeed, one may start

from a simpler model (e.g. one that its properties is well-known) and

extend it afterwards.

The notion of sparsity also meshes well with the view of parsimony in the

Bayesian approach to data analysis, which automatically embodies the notion

of ‘Ockham’s razor’ (MacKay, [116]). This is analogous to the information

theoretic notion of ‘minimum description length’ (MDL) of Rissanen [146]

and offers a way of measuring and controlling complexity. This has effects

on modelling since, as pointed out by Neal [131], in the Bayesian framework

one should not artificially constrain analysis to simple models, as the use

of hierarchical modelling readily offers a mechanism for complexity control.

From a modeling point of view, these are also tools that can be used in order

to avoid ‘overfitting’ in the absence of detailed information (‘ignorance’) and

in the presence of noise.

Now, in Bayesian terms, inference is calculating the posterior probability

of the latent variables in a probabilistic model and learning is calculating the

posterior over the model parameters. Since in full Bayesian modelling both

latent variables and parameters get a prior distribution and are dealt in an

equal footing, these two computational processes are essentially the same.

48

Bayesian inference and learning, however, requires performing integra-

tions in high-dimensional spaces that, if tackled by sampling methods like

Markov chain Monte Carlo can be very time-consuming, or even practically

impossible, for large data sets. A mathematically elegant alternative, pro-

viding many of the benefits of sampling in terms of accuracy without the

computational overhead, is offered by the variational Bayesian method [7].

This is the inference method that we are going to use in this text.

This section provided a high-level overview of the philosophy and basic

principles of the Bayesian approach. More detailed explanations will be given

in the subsequent chapters and specific Bayesian methods will be introduced

on an as-needed basis.

49

Chapter 3

An Iterative Thresholding

Approach to Sparse Blind

Source Separation

3.1 Blind Source Separation as a Blind In-

verse Problem

Many problems in signal processing, neural computation, and machine learn-

ing can be stated as inverse problems, when the quantities of interest are not

directly observable but rather have to be inferred from the data [33]. Ex-

amples include signal deconvolution, density estimation, as well as many

problems in neuroimaging. Inversion procedures involve the estimation of

the unobserved generative variables (states, sources, or “causes”, in an ab-

stract sense) and/or parameters of a system from a set of measurements. If

we collectively denote the space of unknowns by X and the data space by

Y , we can symbolically write the process as shown in Fig. 3.1, where the

50

Forward problem

Inverse problem

Causes Effects

Figure 3.1: Conceptual scheme for forward/inverse problems.

rightwards arrow denotes the generative direction and the leftwards arrow

the direction of the inversion. — This is to be contrasted with computing

the data given their generators, which is called the ‘forward’ problem. — In

this Chapter we emphasize the fact that blind signal separation, the separa-

tion of mixtures into their constituent components, is inherently an inverse

problem, and that its solution can be derived from the general principles of

this class of problems. In particular, we will focus on the problem of blind

signal separation under sparsity constraints.

The most important characteristic of inverse problems is that they are

typically ill-posed (“naturally unstable”) [92]. A problem is said to be well

posed, in the Hadamard sense, if there is existence, uniqueness, and conti-

nuity (stability) of the solutions. If any of these conditions is not met, a

problem is called ill-posed. A critical consequence is that in inverse problems

even a minute amount of noise or error might get exceedingly amplified by

the inversion, if simple-minded, producing nonsensical results.

At first sight, solving an inverse problem might seem like a hopeless task

mathematically. However, a lot can be gained by appropriately constraining

it, as this forces the inversion algorithm to stay in meaningful regions of

the space of unknowns. What “meaningful” is depends, of cource, on the

particular problem domain. Knowledge of special properties of the problem,

captured in the form of prior constraints, can be extremely helpful in these

analyses, as they lead to regularization, making the problem well-posed [25],

51

[33].

In classical regularization theory (Tichonov and Arsenin, [165]), unique,

stable solutions are sought by requiring certain properties such as smooth-

ness and/or simplicity. The general idea there is to formulate optimization

functionals, also known as Lagrangians, of the form

I = ‖Ax− y‖2︸︷︷︸

ED

+λ ‖Bx‖2︸︷︷︸

EP

, (3.1)

where the penalty term, ‖Bx‖2, captures our prior knowledge or expectations

about the solution, x ∈ X , by penalizing unwanted properties of the solution,

controlled by the operator B, while the first term represents the data-fit.

The ‘forward’ linear operator, A, maps the unknowns to the observation

space. The λ factor is a Lagrange multiplier, balancing the effect of the two

terms on the solution1. This idea has found numerous applications in science

and engineering2. See also Chen and Haykin [33] for an excellent review of

regularization theory in the context of learning and neural networks.

1To see how regularization can stabilize the problem, rewrite the functional I ofEq. (3.1) as (see [25])

‖y −Ax‖2 + λ‖Bx‖2 =

∥∥∥∥

[y0

]

−[

A√λB

]

x

∥∥∥∥

2

.

The optimizer of the functional is then the (least-squares) solution of the linear system

[A√λB

]

x =

[y0

]

.

Under certain mild conditions on the matrix B (such as that the norm of ‖Bx‖ is bounded),the augmentation of the matrix A by

√λB considerably reduces the ill-posedness of the

problem.2In computer vision, for example, Poggio [142], Blake and Zisserman [18], and many

others, have addressed problems in early vision, such as shape from shading, surfacereconstruction, etc., [18], by formulating regularized functionals of the form of Eq. (3.1).Girosi et al. [59], pioneered the use of regularization theory in the machine learningcommunity.

52

Blind Source Separation

In the blind source separation (BSS) problem, a set of sensor signals re-

sults from mixing a set of unobserved ‘sources’ and is possibly contaminated

by noise. The goal is to recover/reconstruct the sources (‘unmix’ the ob-

servations) without any, or very little, knowledge of the sources themselves,

the mapping from sources to observations, or the noise process [98]. By its

definition, BSS is therefore inherently a multidimensional inverse problem.

The ‘cocktail party problem’, regarding the separation of mixed speech

signals from people talking simultaneously in a room, described in the Intro-

duction, is the prototypical BSS problem. Many other interesting problems in

biomedical signal processing, financial time-series analysis, and telecommu-

nications can also be cast as BSS problems [87]. In functional neuroimaging,

McKeown et al. [125] proposed the decomposition of spatio-temporal fMRI

data into independent spatial components and their associated time-courses

of activation using BSS techniques. Since then, numerous studies in fMRI

have been conducted under this decomposition framework, with considerable

success. BSS offers a powerful alterative to ‘model-based’ approaches, such

as the general linear model (GLM), for exploratory (data-driven) data anal-

yses of complex data sets for which precise information about the temporal

or spatial structure of the sought-for processes is either not available or is

itself under investigation.

Mathematical Model of Source Separation

In order to make the above abstract description of BSS mathematically and

computationally tractable, one typically needs to formulate a model of the

process —a model is a formal way of “thinking about data”— as well as an

algorithm for its inversion. A simple observation model, which nevertheless

works surprisingly well in a wide range of cases, assumes that the data is a

53

linear mixture of underlying sources with an additional ‘noise’ term: for the

dth sensor and the nth data point,

d : xd[n] = µd +L∑

l=1

Adlsl[n] + εxd[n], ∀n , (3.2)

wherexd

D

d=1are the D sensor signals,

sl

L

l=1the L unobserved sources,

µd is the mean of the dth observation, and εxdis the observation noise, or

error, term. That is, the models we consider here assume a noisy linear

transformation but with the crucial difference that the observation operator,

A =[Adl

], called ‘mixing matrix’ in BSS, is also unknown (hence the char-

acterization “blind” for this particular class of inverse problems). Indeed, we

can think of problems such as these as multiple regression with missing in-

puts (the sources). Collecting all signals, for all sensors and sources, together

the linear BSS model can be written in matrix form as

X = µ1T

N + AS + EX , (3.3)

with X ∈ RD×N , S ∈ RL×N , µ ∈ RD and A ∈ RD×L, where the indices d

and n now denote the rows and columns of the data matrix X, respectively.

We emphasize that we aim at estimating mixing matrices of arbitrary size

D × L, not just square ones, here. We will assume that the observations

have been de-meaned and fix µ at the mean of the observations, µ.= X,

(assuming zero-mean sources; see next)3 in this chapter. The image of S

under the linear operator A can be thought of as the “clean” data, X, which

is subsequently corrupted by EX. Source separation algorithms in general

3That is, X is replaced by X− µ1NT, and we shift the observation coordinate system

at the data centroid. The mean can always be added back if required. From now on,the symbol X will be taken to implicitly mean the de-meaned observations, unless statedotherwise.

54

may, or may not, include an explicit model for the noise.

The blind source separation problem is clearly ill-posed, since both fac-

tors, Adl and sl(t), in equation (3.2) are unknown, and its solution is non-

unique as it stands [79]. This is particularly true for the so-called ‘over-

complete’ case, where the number of sources is greater than the number of

observations. In this case, even if we knew the mixing matrix exactly, we

still wouldn’t be guaranteed a successful reconstruction, in purely algebraic

terms, since the system of equations to be solved is underdetermined. More-

over, noise could destabilize the inversion.

As in all inverse problems, the success of source separation hinges on

choosing an appropriate model for the unknown signals, the ‘source model’,

and on constructing an efficient inversion algorithm. It is well known (see e.g.

[148]) that classical, quadratic constraints cannot be used for blind source

separation. Quadratic constraints (or log–priors in a Bayesian formulation)

cannot lead to source separation but merely to decorrelation, since they only

exploit up-to second-order statistics in the data (see [16]); higher-order statis-

tics are needed for the BSS problem.

Optimizing for Independence vs Optimizing for Sparsity

Like many other learning problems, separating signals into their constituent

components can be formulated as an optimization problem, incorporating

a variety of appropriate penalties. Hochreiter and Schmidhuber [84] notice

that source separation can be obtained as a “by-product” of regularization

in a low-complexity autoencoder network, establishing a link between regu-

larization and unsupervised learning in the context of BSS. Indeed, most

approaches to blind source separation, such the independent component

analysis (ICA) family of models, also make analogous prior assumptions,

implicitly or explicitly. ICA forces independence among the components,

55

P (s) =∏L

l=1 Pl(sl), where P (s) is the assumed distribution of the sources.

This enables it to separate statistically independent sources (up to permu-

tation and scaling). Often the separation is performed under a maximum

likelihood framework [118], [169].

Crucially, however, the source separation algorithms in the ICA family

work under the assumption of mathematical independence among the com-

ponents. This is a very strong condition. Independent random variables are

uncorrelated; the converse is not true. In many physical and biological sys-

tems, such as brain fMRI, there is no physical reason for the components

to correspond to different activity patterns with independent distributions.

Although one can always map the data in a coordinate system such that

the representation is least-dependent, this does not necessarily lead to in-

terpretable components. In many cases, other conditions or priors can be

more physically plausible or efficient than the condition of mathematical in-

dependence. A substantial body of experimental and theoretical research4

(Field, [62]; Olshausen and Field, [137]; Donoho, [50]; Zibulevsky and Peal-

mutter, [176]; Li, Cichocki and Amari, [111]) suggests that sparsity with

respect to appropriate ‘dictionaries’ of functions/signals might be an under-

lying property of many information processing systems. Moreover, recent

work in neuroscience (Asari, [3]) and imaging neuroscience (Daubecies et al.,

[46]), shows how sparse respresentations directly facilitate computations of

interest in biological sensory processing and respectively reveals that activa-

tions inferred from functional MRI data have sparse structure [46], [30]. For

a sketch of the sought-for brain activations measured by fMRI see Fig. 1.1

and the description of Golden [77]. In all of the works referenced above the

signals have a natural sparse structure5.

4See section 2.4 for a more extensive literature review.5An important research question is whether sparsity represents something more fun-

damental, i.e. an organization principle, in physical and biological information processingsystems. This is an area of active current research, but it is beyond the scope of this

56

Remark 7 (The Sparseness Principle) Sparsity reflects the fact that the

basis functions used represent those natural signals most parsimoniously. In-

tuitively, the idea is that if the elements of the dictionary match the charac-

teristics or features of a signal, then we only need a few of them to describe

it. We can view these dictionary elements as “building blocks” that are used

to construct more complex signals.

Although in some cases the above separation problem can be solved by

independent component analysis, viewed as a technique for seeking sparse

components, assuming appropriate distributions for the sources, as shown by

Daubechies et al. in [46], in other cases this approach fails. We propose to

explicitly optimize for sparse (or minimum entropy, in the terminology of Bar-

low [12]) components in appropriate signal dictionaries, instead of optimizing

for independence. This is captured by formulating appropriate optimization

functionals, which are accordingly efficiently optimized by an algorithm spe-

cially constructed for this purpose. While classical regularization-theory con-

straints cannot themselves be used for BSS, for the reasons discussed above,

they provide a conceptual starting point in our search for regularizers, as

they are often related to the, more generally desirable, property of smooth-

ness. Smoothness is often a highly plausible constraint physically. Moreover,

inverse problems are usually ill-posed (in the sense of Hadamard), there-

fore sensitive to noise: a simple-minded inverse will amplify the noise and

swamp solutions. Consequently, some kind of smoothness constraint is com-

monly imposed. However, “plain” ℓ2 solutions, or solutions that are based

on smoothness only, such as Fourier bases or truncated SVD (see e.g. D.

O’Leary, [135]), may miss discontinuities and small features. A more general

definition of smoothness that allows for discontinuities and other singulari-

ties will be introduced next and a better basis, based on wavelets, will be

Thesis.

57

utilized in this text. As we shall see in subsection 3.2.2, one can go from that

requirement to a class of smoothness spaces, called Besov spaces, and finally

to wavelets, building blocks for signals offering sparse solutions.

3.2 Blind Source Separation by Sparsity Max-

imization

Before we elaborate on our approach for sparse BSS, we quickly give a very

concise summary of some key approaches to sparse signal reconstruction that

are highly relevant to our model next. The rationale of signal reconstruction

utilizing sparsity was introduced in Sect. 2.4.2. Here we make the connection

with some of the methods from the sparsity literature that can be regarded

as direct antecedents of some of the ideas in this Chapter. We then state the

goal of sparse component analysis (SCA).

3.2.1 Exploiting Sparsity for Signal Reconstruction

Unsupervised learning algorithms for sparse decompositions have received

much attention in recent years. In their seminal paper [137], Olshausen and

Field model natural images as a linear superposition of basis functions,φk

,

modulated (multiplied) by coefficients,ak

, termed ‘activities’ 6:

I(x, y) ≈∑

k

akφk(x, y) ,

where (x, y) is an image coordinate or picture element (pixel). The goal is to

6Note that the activities, a, here correspond to the sources in the BSS problem andthe bases, φ, to the columns of the mixing matrix.

58

0 0.2 0.4 0.6 0.8

1 1.2 1.4 1.6 1.8

-2 -1 0 1 2

S(a

)

a

S(a)

-1

-0.5

0

0.5

1

-2 -1 0 1 2

S(a

)

a

S’(a)

Figure 3.2: Sparsity function, S(·), of the model of Olshausen and Field [137]and its derivative, S ′(·). Note that the shape of the derivative is almost asigmoidal function. The latter encourages their learning algorithm to searchfor sparse solutions.

learn a sparse ‘code’, a =(a1, . . . , ak, . . . , a|Φ|

)(where |·| here denotes the size

of a collection of variables), for natural images as well as adaptively learn the

bases φk themselves from the data. The bases may be non-orthogonal. This is

achieved by constructing a non-quadratic energy cost function, parameterized

by Φ =[φk

],

E = E(a;Φ) =∥∥I−Φa

∥∥

2

2︸︷︷︸

information preservation

+ λ ·∑

k

S(ak)

︸︷︷︸

sparseness of ‘activities’

, (3.4)

where S : a 7→ S(a) is a sparsity function, penalizing non-sparsity of the

activities. The penalty term is then the sum of sparseness of the activities

vector, a, while the first term computes the reconstruction error, and en-

sures that information is maximally preserved. The activity vector records

the configuration of weights modulating the basis nodes,φk

, for a par-

ticular image instance. By putting a sparse prior on a we obtain a sparse

representation of I in Φ, as was shown in Fig. 2.3. Olshausen and Field

choose S(a) = log (1 + a2) as the sparsity function in their implementation

(see Fig. 3.2); however, there is a variety of other sparsity-inducing priors

that can be used (see section 3.2.2 later). A stricking result of their model

59

is its biological plausibility: the learned basis functions emerging from the

model resemble the spatially localized and oriented filters that are found in

the mammalian primary visual cortex, V1 [137].

While our approach is partially inspired by the work of Olshausen and

Field, our problem is different: our application is source separation, and we

aim at learning a sparse representation of whole images, rather that image

patches and filters. Moreover, in contrast to Olshausen and Field, we will

assume that they are the expansion coefficients of the sources with respect to

a signal dictionary, not (necessarily) the sources themselves, that are mostly

non-zero (inactive), for the reasons that will explained shortly. This will lead

to a different, hierarchical optimization functional7.

Donoho [50] reviews the paper of Olshausen and Field and, based on

their findings, proposes a research programme for sparse decompositions of

objects that belong to certain classes with well-describable geometrical char-

acteristics. This way, precise control over the decomposition can be achieved.

Although his class of objects is rather artificial, the general idea can have

a wider applicability. In this text, we propose an analogous approach, by

utilising “soft constraints” on the properties of the sought-for latent signals.

This will become more precise in Sect. 3.2.2.

3.2.2 Exploiting Sparsity for Blind Source Separation:

Sparse Component Analysis

Sparse BSS is based on the following key observation:

Remark 8 In the context of source separation, the key observation is that

7The model of Olshausen and Field can be considered the “baseline” case for sparsity,i.e. the case of a signal that is sparse in physical space (e.g. in time domain). Here thedictionary, Φ, is the ‘canonical’ basis, B = et, where et = (. . . , 0, 1, 0, . . .) with 1 in thetth position and 0 everywhere else. This is equivalent to saying that the sparseness prioris on the values of the signal itself, rather than on a representation in a dictionary.

60

the mixing process generates signals that are typically less sparse than the

original ones. This suggests that one could formulate a learning algorithm

for the inversion problem that seeks components that are maximally sparse,

in order to recover the sources.

We emphasize that sparseness is a relative property, with respect to a basis

set. In the general case, signals will not be sparse in the original observation-

domain but in a transformed domain, with respect to more complex build-

ing blocks. More general bases, or ‘dictionaries’, than the canonical basis

will therefore be needed to efficiently describe the signals [176], [111]. We

elaborate on this in subsection 3.2.2, ‘Sparse representation of signals using

wavelets’. The goal of Sparse Component Analysis with dictionary-based

priors can now be stated in the following definition (adapted from Zibulevsky

and Pearlmutter, [176]) as

Definition 1 (Dictionary-based Sparse Component Analysis) Given

an observed data set,xd

D

d=1, find the sources,

sl

L

l=1, and the mixing ma-

trix, A ∈ RD×L, such that the sources are sparsest, with respect to a given

signal dictionary, under the noisy linear model xd ≈∑D

d=1Adlsl.

We will develop algorithms that exploit the sparsity of signals8 to perform

blind source separation and signal reconstruction.

Sparse Representation of Signals using Wavelets

In this section we answer the question “what kind of dictionary?”. In par-

ticular, this section aims to provide some insight into how the combination

of certain concepts that will be introduced next allows us to exploit sparsity

for signal reconstruction. We only give the essential information here; more

theoretical background on wavelets and smoothness spaces can be found in

e.g. [44].

8By the term “signals” we mean the unknown signals (sources) here.

61

It is the interplay of three ingredients that allows us to exploit sparsity

for signal reconstruction, namely that:

• Many signal classes contain certain types of structure that can be cap-

tured by mathematical models. These signal classes can be quite gen-

eral, such as the class of finite-energy signals, for example. We will

discuss such a class here.

• Appropriate dictionary elements can be used as building blocks for sig-

nals, capturing the salient features of those signals efficiently.

• Certain regularizers/priors can implement and enforce the sparsity con-

straint on the expansion coefficients of the signals in those dictionaries

(see also Fig. 2.7).

We review some of the basic properties of wavelets that are of importance

to our framework next. Wavelets, and other wavelet-like function systems9,

are a natural way to model sparsity and smoothness for a very broad class of

signals [121]. They form ‘families’ of localized, oscillating, bandpass functions

that are dilated and shifted versions of a special function called ‘mother

wavelet’, ψ. As such, they inherently contain a notion of scale, the whole

family spanning several scales from coarse to fine. This property allows them

to form multi-resolution decompositions of any finite-energy signal, f .

In the classical setting, wavelets together with a carefully chosen, lowpass

function, the ‘scaling function’, φ, can form orthonormal bases for L2(R).

Let us denote such a basis with D = φk⋃(⋃

j

ψj,k

)

, where j denotes

the scale and k the shift. We can then perform wavelet expansions (inverse

9Such as wavelet packets, local cosine bases, curvelets, ridgelets, and a whole othervariety of bases.

62

wavelet transforms) of the form

c 7−→ f s.t. f =∑

k

c(φ)k φk +

∑

j

∑

k

c(ψ)j,k ψj,k, j, k ∈ Z, (3.5)

where10 ϕλ ∈

φk(t) ≡ φ (t− tk) , ψj,k(t) ≡ 2j/2ψ (2jt− tk)

j,kis an ele-

ment inD, and the scaling and wavelet coefficients, cλ ∈ C =

c(φ)k

⋃(⋃

j

c(ψ)j,k

)

,

known as the ‘smoothness’ (or ‘approximation’) and ‘detail’ coefficients, re-

spectively, are computed by taking inner products —measures of similarity—

of the corresponding bases11 with the signal:

λ : cλ ≡ fλ = 〈f ,ϕλ〉 . (3.6)

Wavelet bases in higher dimensions can be built by taking appropriate prod-

ucts of one-dimensional wavelets. An example of the wavelet decomposition

of an image is shown in Fig. 3.3. The above construction provides a multi-

scale representation of a signal in terms of a wavelet dictionary. There exist

computationally efficient, linear time complexity algorithms for performing

forward and inverse discrete wavelet transforms, utilizing pairs of perfect

reconstruction filters [120].

An intuitive way of defining the smoothness of a signal is to use some

kind of generalized derivative and quantify roughness by the “jumps” of

the values of the function along an interval. Holder regularity gives such

a measure: a signal, f , has Holder regularity with exponent α = n + r,

with n ∈ N and 0 ≤ r < 1, if there exists a constant, C, such that

10To unclutter notation, we will use a single generic index, λ, to index bases and theircorresponding coefficients and the symbol ϕλ for the bases, when appropriate, since theelements in D can be ordered as λ = λ(b, j, k) = 1, . . . , Λ= |D|. The symbol b indexes the‘subband’, i.e. the subset of wavelets corresponding to a particular direction, in higherdimensional wavelets; see Fig. 3.3, for example.

11More precisely, the dual (analysis) wavelet, ϕλ, which, in the case of an orthonormalbasis, is identical to ϕλ.

63

Figure 3.3: Wavelet coefficients of the ‘Cameraman’ image, cλ, arrangedin the Mallat ordering, λ ∈ MJ , for J = 4 levels of decomposition. (In theMallat ordering, the coefficients of the multiscale decomposition are placed ina rectangular arrangement such that the upper left corner is occupied by thesmoothness coefficients and the detail subbands are placed perimetricallyaround them, completing the square, starting from the largest scale andgoing outwards to the smallest scale. ) The physical-space image of Fig. 2.10is now represented/decomposed in a relatively small number of ‘primitives’(building blocks), the wavelet and scaling bases. Wavelets act as multiscaleedge detectors: the vertical, horizontal and diagonal elements of the imageare clearly visible. We see that the physical-domain image is transformedinto a space where much less bases are needed in order to describe the signal,as it is evidence from the ratio of zero (black) versus non-zero coefficients.

64

∣∣f (n)(x)− f (n)(y)

∣∣ ≤ C |x− y|r, x, y ∈ R. In words, the difference in the

values of the n–th derivative in an interval, x − y, is smaller than a cer-

tain multiple of the interval in the domain of the function, raised to the

rth power. Functions with a large Holder exponent will be smooth. Note

that the Holder exponent can be non-integer, offering fine-grained definition

of smoothness. The crucial link between Holder regularity and wavelets can

be expressed as follows (Daubechies, [44]): if a signal is smooth with local

Holder regularity with exponent α in a neighborhood (x0−ε, x0 + ε), around

x0 ∈ R, for some ε > 0, such that the neighborhood is covered by a set of

waveletsψj,k

with corresponding index-set I, (x0−ε, x0 +ε) ⊂ supp(ψj,k),

where ‘supp’ denotes the support, then the largest wavelet coefficient at each

scale j, will be smaller than a multiple of (2−j)α+ 1

2 , for all (j, k) ∈ I. That

is to say, smooth signals lead to small values for the wavelet coefficients of

the expansion. The above helps us transfer our attention from the concept

of smoothness expressed as Holder regularity in physical space to wavelets.

In terms of the above expansion, sparsity means that most coefficients

will be “small” and only a few of them12 will be significantly different than

zero13, resulting in very sparse representations of signals. In this sense, a

regularizer that enforces sparsity will therefore correspond to a ‘complexity’

prior 14 (Moulin and Liu, [129]) on the function f . Furthermore, in the stan-

12These large coefficients will be typically located around discontinuities and small fea-tures.

13This is in effect a definition of sparsity under a classical (‘frequentist’) interpretationof probability (relative frequency of occurrence over an ensemble of objects). Takingit literally, one could use an ℓ0, ‘counting’, norm, defined as the number of non-zerocoefficients in the sequence, ‖cλ‖0 = |cλ : cλ 6= 0|, where | · | here denotes cardinality,in order to measure the sparsity of signals. However, the ℓ0 norm is unfortunately verysensitive to noise and its computational complexity is quite high, making it a less attractiveoption for a sparsity measure in practice. Therefore, other norms, with p > 0, will be usedhere.

14The goal of simplicity leads to the strive for complexity control. That is, we invokethe principle of parsimony here: “simpler explanations are, other things being equal,generally better than more complex ones”. Furthermore, there is a natural connectionbetween complexity of representation and entropy. This leads to the goal of efficiency via

65

dard ‘signal plus noise’ model, decomposing data in an appropriate dictio-

nary will typically result in large coefficients modelling the signal and small

ones corresponding to noise. This is a reflection of the good approximation

properties of wavelets (Mallat, [121]).

Finally, there exists an important connection between wavelets and

smoothness characterization using Besov spaces. Besov spaces are smooth-

ness classes: a Besov space Bσp,q (R) is basically a space consisting of functions

with σ derivatives in Lp, with q providing some additional fine-tuning15 [45].

We say that an object belongs to a Besov space if it has a finite Besov norm.

Taken together,

Bσp,q =

f ∈ Lp(R) : ‖f‖Bσp,q<∞

,

where ‖·‖Bσp,q

is the Besov (semi-)norm16. The important fact for our purposes

is that wavelets form an unconditional basis for Besov spaces. This means

that the norm (“size”) of an object in Besov space can be equivalently com-

puted from a simple sequence norm (Chambolle et al., [31]), ‖f‖Bσp,p≍ ‖c‖s,p,

where

||c||s,p =

(∑

λ

2spj|cλ|p)1/p

,

where s = σ + 1/2− 1/p, j = j(λ) denotes the scale of the λ–th coefficient,

and ‘≍’ denotes norm equivalence. In other words, smoothness translates

to sparseness. This makes sense intuitively, because if there is a very high

degree of detail in a signal then we will need a large number of elements

from our dictionary of building blocks in order to accurately describe it. The

minimum entropy representations.15In particular, q < q′ ⇒ Bσ

p,q ⊂ Bσp,q′ . Note also that the Holder class is equivalent

Cσ = Bσ∞,∞ [54].

16A convenient choice in practice, which we will adopt in this text, is to set q = p anddrop this index from the equations.

66

above, functional-analytic framework, formalizes this concept.

The above norm is a special case of a weighted ℓp norm,

‖c‖ℓp,w =∑

λ

wλ|cλ|p,

used in Daubechies et al. (2005) [45], with weights given by wλ = 2spj.

ℓp norms, in particular, have been widely used in applications for modelling

wavelet priors. More generally, a variety of relevant priors can be used under

a Bayesian framework (Moulin and Liu, [129]).

Based on this background knowledge, the main point of this subsection

can be stated as [45]:

Remark 9 If the objects to be reconstructed are expected to be mostly smooth,

with localized lower dimensional singularities, we can expect that their expan-

sions into wavelets will be sparse.

An example of such an object/signal is given in Fig. 3.4. We shall exploit the

property of wavelets to parsimoniously describe the above class of objects in

order to construct algorithms for sparse BSS.

Formulating the Sparse Objective

As discussed in the Introduction, without appropriate constraints blind

source separation is a very ill-posed problem. The key observation we want

to exploit and formalize here is that: (a) a wide class of signals are sparse

with respect to (i.e. can be sparsely represented in) appropriate dictionaries

[176] and (b) that mixing generates denser signals. An intuitive way to for-

mulate the optimization functional for sparse problems would then be one

that captures this fact explicitly. In particular, we want to model the notion

that the activities of the sources with respect to those signal dictionaries are

“rarely active”, i.e. non-zero [137]. Following Olshausen and Field [137], we

67

Figure 3.4: A signal exhibiting smooth areas and localized lower dimensionalsingularities.

formulate the sparse problem in an energy minimization framework, instead

of working in a functional-analytic framework. Our model, however, assumes

sparse structure of the unknown sources in a dictionary, not (necessarily) of

the observations.

In order to formulate the sparse objective one may now consider di-

rectly using Besov priors in the wavelet domain (Leporini and Pesquet, [108];

Golden, [77]). The separation problem can then be formulated as one search-

ing for pairs of variables(sl, al)

L

l=1minimizing the functional

E(S,A

)=∥∥X− AS

︸︷︷︸

X

∥∥2

2+ λS

∑

l

‖sl‖Bσp,p

+ λA

∑

l

‖al‖22 , (3.7)

where 0 < λS, λA <∞ are Lagrange multipliers, at our disposal (see below).

For spatio-temporal data, such as fMRI, we seek spatio-temporal compo-

nents, i.e. pairs of time-courses and spatial maps.

The three terms appearing in Eq. (3.7) can be explained as follows. The

first term in the above objective forces the solution to be such that X, the

image of a value from the model, S, mapped to the observation space by

68

A, is as close as possible to the observations, X, in an ℓ2 sense (e.g. in the

matrix 2–norm or the Frobenius norm, in a matrix formulation). Therefore,

this is a ‘fidelity’ term, which minimizes the residuals by forcing the columns

of the mixing matrix to span the input space, ensuring that “information is

preserved” [137]. It is the equivalent of the χ2 for the observations empirical

statistic [116]

χ2 = (X−AS)T Σ−1 (X−AS) , (3.8)

where Σ is the data covariance.

The second and third terms are penalties on S and A, respectively. The

second term is a penalty on the “size” of the sources, sl, with respect to

the Besov norm Bσp,p. In doing this, we restrict the solution to belong to

certain classes of smoothness spaces, viz. Besov spaces. By writing the

wavelet expansion, s =∑

k∈Zc−1,kφ(t−k)+∑j≥0

∑

k∈Zcj,k2

j/2ψ(2jt−k), and

using the equivalence between the norms ‖s‖p + ‖s‖Bσp,q

and ‖c−1‖p + ‖c‖bsp,q

,

in Besov and wavelet-coefficient domain, respectively (see subsection 3.2.2),

this turns out to be a penalty on the coefficients of the decomposition of (the

source-part of) the solution with respect to a dictionary of functions17. That

is, sparsity is defined on an appropriate representation of the sources. The

second term is therefore a penalty that enforces the sparsity constraint.

To provide a little more flexibility than the above prior, we can use the

more general weighted ℓp norm proposed in Daubechies et al. [45],

‖cl‖p,w =∑

λ

wl,λ|cl,λ|p, wl,λ ∈ R, with 1 ≤ p < 2 . (3.9)

An important special case is gotten if we let the exponent, p, be equal to

the smallest value in its range, i.e. let p be equal to 1. This gives the ℓ1

17In a Bayesian formulation one can equivalently impose the prior

p(c|β) ∝ β(2J−1)/α exp(

−β‖c‖αbsp,q

)

, β > 0, α > 0, on the expansion coefficients, where J

is the number of decomposition levels [108].

69

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

l∞l4l2l1

l1/2l1/4

l0

Figure 3.5: A family of ℓp norms for various values of p. The unit circles(‘balls’) of each norm are drawn here. These are curves of the form |x/a|p +|y/b|p = 1, a, b > 0, defining ‘superellipses’ in the rectangle −a ≤ x ≤+a × −b ≤ y ≤ +b (Lame curves); here a = b = 1. Curves for ℓ∞, ℓ4, ℓ2, ℓ1,ℓ 1

2, ℓ 1

4, and ℓ0 are shown. The ℓ1 norm “strikes a middle ground”, by being

the sparsest one that is also convex.

prior, ‖cl‖1 = wl

∑

λ |cl,λ|, wl,λ = wl, which is related to the Besov norm

Bσ1,1. This choice has the advantage of being the sparsest prior in this family

that maintais the convexity property for the penalty: see Fig. 3.5. The ℓ1

prior has also proven to be robust to noise [111].

We enforce the weighted ℓp sparsity prior on cl in combination with the

structural constraint on the sources

sl =∑

λ

cl,λϕλ, l = 1, . . . , L , (3.10)

where cl,λ = 〈sl,ϕλ〉 is the inner product of the source sl with the dictionary

element ϕλ; these can be thought of as measures of “similarity” of these two

functions. In other words, we assume that the sources are synthesized as

linear combinations of our building blocks of choice, the wavelets ϕλ. We

essentially “project” the unknown signals in a Besov smothness space (Choi

and Baraniuk, [35]).

Finally, the third term is a “growth-limiting” penalty (Olshausen and

70

Field, [138]), which ensures that we avoid the unwanted effect of the mixing

vectors, al, growing without bound. Note that without this term we could

get a solution that maximizes the “likelihood” (fitting) of the observations

under the model18 without being a physically plausible one. Li et al. [111]

use an analogous constraint, namely that the vectors al are unit vectors,

by rescaling their 2–norms to be equal to 1, provided that we rescale the

sources accordingly. Olshausen and Field make a similar adjustment in [138]

by rescaling the ℓ2 norms so that the output variance of each coefficient is

held at a desired pre-fixed level.

Using the above, our optimization functional E can be finally written as:

EΦ(C,A

)=∥∥∥X−A

(ΦCT

)T∥∥∥

2

2+ λC

∑

l

∑

λwl,λ|cl,λ|p + λA

∑

l ‖al‖22 ,

S =(ΦCT

)T,

(3.11)

with 1 ≤ p < 2, where the linear operator Φ : cl 7→ sl represents the basis19

ϕλ. By putting the sparsity constraint on the activities of the sources, this

leads to a sparse representation of those sources in Φ, as will be demonstrated

later. To gain a little more intuition about the method, the optimization

functional can be viewed as a hierarchical “feedforward” network. This will

be shown in section 3.3.2, after the optimization algorithm is stated.

Remark 10 The energy function EΦ(C,A

)prefers bases that allow sparse

representations. To see this, consider the penalty term on the activities

cλ, with p < 2 as above. For a fixed level of activity ‘energy’ (variance),

‖c‖2ℓ2 =∑

λ cλ2 (see Fig.3.6), the value of the penalty is minimized if the

vector of coefficients has the fewest possible number of non-zero components.

18The precise definition of the term ‘likelihood’ in the probabilistic context will be givenin section 4.3.2.

19Note that this, multiplicative, formulation is given only for providing the formal spec-ification of the problem. In practice, the multiplication by Φ is never performed explicitly;wavelet analysis/synthesis is efficiently implemented by filter banks [121].

71

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-5 -4 -3 -2 -1 0 1 2 3 4 5

LaplaceGaussianUniform

Figure 3.6: Three distributions of wavelet coefficients with equal ‘variance’(spread) but different ‘kurtosis’ (peakedness). We can also think of thesecurves as graphs that represent weights that are put on each coefficient,according to its magnitude, or, conversely, as penalties. The highly-peakedone (solid curve) corresponds to a sparsity-inducing, ℓ1 prior, as it assigns amuch higher weight to small coefficients.

This can be also seen from Fig. 3.7, where the ℓ1 and ℓ2 ‘balls’ are plotted.

The amount of penalization is determined by the Lagrange multipliers λC

and λA, balancing these three different “forces” (the three terms of Eq. 3.11).

If these coefficients are large, then the last two terms will be small at the

minimizer, resulting in variables S and A that are smoother/smaller. On

the contrary, if these are small the algorithm will try to fit the data more

closely, at the expense of more “noise”. Thus the parameters λC and λA offer

a wide flexibility, but at the same time pose an issue on how to best set or

estimate them. In this text, a method that empirically performs an estimate

of quantities related to λC and λA that are important for the function of the

method is proposed. This will be made more clear in the sequel.

More generally, if SM : S → R is an appropriate sparsity function defined

on the space of solutions, S, penalizing non-sparse activities with respect to a

signal-generating “machine”,M, capturing our prior expectations about the

generative mechanism of the sought-for signals, we can give a more general

definition for SCA. Formally, and equivalently with definition 1 (given in

72

c1

c2

cs = (0, cs2)

cd = (cd1, cd2) y = Ac

ℓ1 ball

ℓ2 ball

Figure 3.7: This figure shows how the ℓ1 penalty leads to sparse representa-tions, compared to an ℓ2 penalty, for a two-parameter problem. In particular,the problem is to find the optimal vector of coefficients in the (c1, c2)–spaceunder the minimum ℓ2 and ℓ1–norms, represented by their correspondingballs, and such that the linear constraint y = Ac is satisfied. The optimalsolution under the ℓ1–norm, cs, within the affine space of solutions of thelinear system y = Ac, leads to a sparse solution, since it zeroes one of thetwo components of the c–vector. See also Fig. 2.7 and its correspondingexplanation in the text.

subsection 3.2.2), we can now state the goal of SCA as:

Definition 2 (Generalized Dictionary-based Sparse Component Analysis)

Find the expansion coefficients, C, of the sources with respect to a machine,

M, and the mixing matrix, A, as the minimizers of the regularized energy

(objective) functional E ,

(C⋆,A⋆) = argmin E with E =∥∥X−AS

∥∥

2+ λS SM(S) + λA ‖A‖2 ,

(3.12)

such as C is as sparse as possible under the given sparsity function SM(·).

For the model discussed in this chapter, we just define SM(S) to be the

weighted ℓp–norm on the wavelet coefficients of the sources, S ∈ S, as be-

73

fore20.

Optimization

The SCA functional, E , must be optimized with respect to (C,A) in order to

obtain an estimate of these values. In principle, one can now perform “direct”

optimization of this functional. Unfortunately, while the above formulation

provides a straightforward definition of the problem, as shown by Daubechies

et al. in [45], the variational equations derived directly from this functional,

expressed in the ϕλ–basis,

l : ∀λ ∈ IΦ,(ATAsl −ATx

)

λ+

1

2λCwλp |cl,λ|p−1 sgn (cl,λ) = 0 , (3.13)

where (·)λ denotes the operation of picking the λ–th element of the wavelet

transform of the argument in the parentheses, cl,λ is the inner product 〈sl,ϕλ〉,as before, and IΦ denotes the index-set of Φ, result in a coupled system of

nonlinear equations which is non-trivial to solve analytically21. Zibulevsky

and Pearlmutter [176] use a conjugate-gradient based approach to perform

optimization of Eq. (3.11) — conjugate gradients is a classical “workhorse”

method for inverse problems with large number of unknowns. However, as

discussed in their paper, gradient-based approaches result in slow learning22.

To circumvent these problems, we adapt a variational methodolody for in-

verse problems under sparsity constraints recently developed by Daubechies,

Defrise and De Mol [45] to blind inverse problems, i.e. to problems where

the observation operator is unknown. The method, utilizing ‘surrogate’ func-

20In Chapter 4 we show how the regularized energy functional E can be equivalentlyinterpreted in the probabilistic framework, allowing various extensions.

21By this we mean even the original (non-blind) regularization problem with ℓ1, or moregenerally ℓp, 1 ≤ p < 2, priors. The expression ATAsl couples all the equations together,and, since the equations are nonlinear, this makes the system difficult to work with.

22Zibulevsky et al. propose alternative algorithms as well [177], one of which, clusteringfollowed by separation, will be discussed in the applications.

74

tionals and leading to an ‘iterative thresholding’ algorithm, will be described

in more detail in section 3.3.1.

Remark 11 Iterative thresholding is essentially an algorithmic way to im-

pose and compute the sparseness constraint on the solution of a linear inverse

problem, which would otherwise be computationally much more difficult.

Remark 12 We note that blind inverse problems, such as blind separation

and sparse representation with a learned basis, lead to bilinear problems in

(S,A), and therefore the optimization functional E is not convex in general.

Nevertheless, the regularization idea can still be used.

Our solution will lead to an iterative algorithm that alternates between esti-

mating the sources and learning the mixing matrix:

• Sources, S: These are inferred using the Iterative Thresholding algo-

rithm, presented next, enforcing the sparse prior.

• Mixing matrix, A: These are learned using the corresponding estima-

tion equations obtained by solving ∂∂AE(S,A) = 0.

Alternatively, one can use a two-stage, clustering-separation, approach,

as in Zibulevsky et al. [177] and Li et al. [111]. We have experimented

with a simple spherical k-means algorithm [119] with encouraging re-

sults23. These will presented in section 4.6.

The next section will elaborate on the IT approach.

23Zibulevsky et al. use the fuzzy C-means algorithm.

75

3.3 Sparse Component Analysis by Iterative

Thresholding

3.3.1 Iterative Thresholding

Recently, Daubechies, Defrise, and De Mol [45], ‘DDD’ hereafter, introduced

an algorithm for linear inversion problems with sparsity constraints, the iter-

ative thresholding (IT) algorithm. The IT algorithm of DDD finds a unique

minimizer for functionals of the form

I(s) = ‖As− x‖2 +∑

λ∈Ic

wλ|cλ|p , where cλ = 〈s,ϕλ〉 , (3.14)

when the linear operator from the signals of interest to the data (i.e. the

observation operator), A, is known and the value of the exponent, p, of the

ℓp penalty is restricted to be greater than or equal to 1 (and less than 2).

We denote the index-set of the coefficients cλ by ICdef= λ : cλ ∈ C. This

notation allows for compact expression of structure in the set cλ by defining

appropriate predicates for the condition defining the set C.Before we describe the final sparse component analysis algorithm we re-

view iterative thresholding first. Our emphasis will be on a pictorial view

of the structure of the algorithm, rather than a rigorous functional-analytic

derivation, which can be found in [45].

To solve the optimization problem of Eq. (3.14), Daubechies et al. de-

veloped a variational methodolody in the deterministic framework. Their

approach to bypassing the coupling observed in the variational equations24

is to use surrogate functionals. In particular, using a technique from ‘opti-

mization transfer’ (see e.g. [85]), the optimization functional I is replaced

24The variational equations are coupled because the observation operator is not diago-nalizable in the ϕλ–basis in general.

76

by a sequence of simpler surrogate functionals,Ii

. The algorithm then

makes use of three signal spaces (Fig. 3.9): the observation (data) space,

X , the solution space, S, and the wavelet space25,ϕλ

, represented by the

corresponding wavelet expansion coefficients, C =cλ.

Remark 13 By making special assumptions/choices about the structure of

the solution space, S, such as it being a Besov smoothness space, one can

obtain/enforce the required properties for the solution, s, such as smoothness,

localization, and sparsity with respect to the dictionary, Φ.

The original functional is replaced by surrogate functionals that are con-

structed by adding a special term, I(s, z), to it such that certain terms

cancel out26:

I(s) ← I(s, z) = I(s) +

‖s− z‖2 − ‖As−Az‖2

︸︷︷︸

I(s,z)

, (3.15)

where z ∈ S is a “guess” value for the solution27. Provided that the operator

A is bounded, one can find a constant, C, such that∥∥ATA

∥∥ < C so that

CIL − ATA is strictly positive28 and I(s, z) is strictly convex in s for any

z. Then29 I(s, z) is also strictly convex in s and has a unique minimizer for

any z. See Fig. 2.7 After some manipulation, the surrogate functionals can

25In fact, any suitable orthonormal basis can be used.26In particular the terms in ATAs.27In theory, the IT algorithm works with any value for z. However, for faster conver-

gence, it is better to use a “guess” value that is “close” to the solution, s⋆. In the absenceof any meaningful (informed) first estimate, one can use, e.g., z = 0. For the subsequentiterations, z is set to the optimizer of the surrogate functional at the previous iteration.

28If ‖A‖ < 1, we can set C = 1, without loss of generality.29In particular the following theorem holds [174]:

Theorem 2 Let E(x) be an energy function with bounded Hessian ∂2E(x)/∂x∂x. Thenwe can always decompose it into the sum of a convex function and a concave function.

77

be written as

I(s, z) =∑

λ

s2λ − 2sλ

(z + AT(x−Az)

)

λ+ wλ|sλ|p

+ const. , (3.16)

where (·)λ and sλ are defined as in subsection 3.2.2.

Functional minimization then leads to an iterative thresholding algorithm

via a nonlinear Landweber-type iteration [101]. The nonlinearity is obtained

from the sparsity-promoting, non-quadratic penalty and results in a thresh-

olding operator on the values of the (wavelet) expansion coefficients of the

estimated signals in the orthonormal dictionary of our choice. It can be shown

(see30 [45]) that the minimizer of the original functional, I(s), is found by

the iteration s(i) 7→ s(i+1) given by

s(i+1) =∑

λ

Sτ

((IL −ATA

)s(i) + ATx

)

λ︸︷︷︸

cλ

︸︷︷︸

cλ

ϕλ, (3.17)

where Sτ : cλ 7→ cλ is a thresholding operator with threshold τ . This expres-

sion is obtained by functional minimization of the surrogate functionals. For

the case of an ℓ1 prior, Sτ becomes the soft-thresholding function (Fig. 3.8),

Sτ (c) =

c+ τ, if c ≤ −τ

0, if |c| < τ

c− τ, if c ≥ τ

. (3.18)

Remark 14 Note how the replacement of the functional I(s) with the func-

30Note that we use a different notation from that of Daubechies et al. (2004) [45] here,in order to conform with the BSS nomenclature. The correspondence is: s ↔ f , x ↔ g,A↔ K.

78

c = cc c = ( )τ cS

> τc−τ<c

c 0λ

c0 τ−τ

Figure 3.8: Soft thresholding function, Sτ : c 7→ c, with threshold τ , cor-responding to ℓ1 penalization. It can be seen that Sτ has a simple geo-metric interpretation: “small” input coefficients, c0λ, in the interval [−τ, τ ],are set to zero while all others are shifted by τ according to the equationc = sgn(c)(|c| − τ) I|c|>τ, where I|c|>τ is the indicator function on the in-terval |c| > τ. Since for a wide variety of natural signals analysis inappropriate wavelet-type dictionaries results in a large proportion of smallexpansion coefficients, this process results in sparsifying their representationin wavelets even more.

tionals I(s, z) causes the estimating equation (3.17) to decouple as a sum

over individual λ. Also note that, using the iterative scheme, the IT algo-

rithm of DDD avoids inversion of the operator A. This is important for both

computational and numerical reasons.

The iterative thresholding algorithm of DDD can be broken down into

the algorithmic steps shown in Alg. 1. Their schematic representation can

be seen in Fig. 3.9.

The steps (2)–(6) of the above procedure can be seen as a single ‘thresh-

olded Landweber’ step that maps s(i) 7→ s(i+1) via the compound operator

FIT =(Φ Sτ Φ−1

) L (3.19)

(where ‘’ denotes operator composition). Note that this operation is non-

linear, due to the nonlinearity of the thresholding operation [45], [31]. Equa-

79

Algorithm 1 Iterative Thresholding for Linear Inverse Problems(Daubechies, Defrise, De Mol)

0. Let the observed data be denoted by x ∈ X . Let i = 0 and pick aninitial value, s(0) ∈ S, for the unknown signal s, where the upper indexin parentheses denotes the iteration step. Then, iterate the followingsteps:

1. The current estimate of the signal, s(i), is mapped in to the observationspace via the ‘forward’ linear operator, A, as x(i) = As(i).

2. This leads to a residual ∆x(i) = x − x(i), which will be used as a“correction signal”to be fed back into the algorithm in order to adjustthe current estimate of the sources.

3. Compute an intermediate value, s(i), via a Landweber-type operation

L : s(i) 7→ s(i) = s(i) + ∆s(i),

where∆s(i) = AT∆x(i) = AT

(x−As(i)

).

That is, the residual information is mapped back to the latent space.

4. This intermediate value is then decomposed in a basis ϕλ (e.g. awavelet basis) via the corresponding wavelet analysis operator31, Φ−1,

as

c(i)λ

λ∈Ic

∈ C, where it gets penalized for non-sparsity (because of

the second term of the functional I) and soft-thresholded32 by theoperator Sτ (·), applied component-wise (see Fig. 3.8). This results in

a new wavelet sequence,

c(i)λ

λ∈Ic

:

λ : c(i)λ = Sτ

(

c(i)λ

)

, where c(i)λ =

(Φ−1s(i)

)

λ,

for all λ = 1, . . . ,Λ = |Ic|.

5. Finally, c(i) is mapped back to the unknown-signal space via the waveletsynthesis operator, Φ, giving

s(i+1) = Φc(i) .

This is our new (current) estimate of the unknown signal.

6. The algorithm is then repeated from step (2) until convergence, or untila pre-defined maximum number of iterations is reached.

80

tion (3.17) then implements the operator FIT.

Daubechies, Defrise and De Mol prove that if A is non-expansive, which

we ensure here by restricting its norm in an appropriate way (shown next),

the iterative thresholding algorithm converges to the minimizer, s⋆, of the

original functional, I.

Interpretation as a Majorize-Minimize method. The step-by-step

analysis above was given for the better understanding of the structure of the

algorithm; the actual implementation may differ. It highlights, however, the

connection with the generative model that will be proposed in the next Chap-

ter. Indeed, the IT algorithm can be interpreted as a ‘majorize-minimize’

(MM) type algorithm (Hunter and Lange, [85]), an algorithm closely re-

lated to the expectation-maximization (EM) algorithm. Each step involves

the construction of a surrogate functional, I(z(i)

), using principles such as

convexity and inequalities, that ‘majorizes’ an often complicated objective

function. This surrogate is then minimized in order to find a new point in

the domain of the objective, which will lead in the construction of a new

surrogate, etc. Our method of Ch. 4 derives a “soft” version of the iterative

thresholding algorithm of DDD from probabilistic principles by deriving a

variational EM–type algorithm. This enables its generalization such that crit-

ical parameters, such as the threshold value (see next), are learned/estimated

from the data.

3.3.2 The SCA-IT Algorithm

The Sparse Component Analysis by Iterative Thresholding algorithm alter-

nates between reconstructing the unknown signals under the sparsity con-

straint, imposed iteratively by the IT algorithm, and learning the observation

(mixing) operator by solving the equation ∂∂AE(S,A) = 0. The algorithm

81

1

2 34

5

6X S C

x

x(i)

∆x(i)

A

AT

s(i)∆s(i)

s(i)

s(i+1)

Φ

Φ−1

c(i)

c(i)

SτL

Observation space Solution space Wavelet space

Figure 3.9: An illustration of the Iterative Thresholding (IT) algorithm asa double mapping operation among signal spaces. The observed data isdenoted by x ∈ X and the sought-for signal by s ∈ S. The numberedsquares in the figure correspond to the algorithmic steps of Alg. 1. Startingfrom an initial value s(0), the IT algorithm performs a sequence of iterationss(i) 7→ s(i+1). The ‘forward’ mapping from the solution (unknown-signal)space, S, to the observation space, X , is represented by the linear operatorA, and that from S to the wavelet (coefficient) space, C, i.e. the waveletanalysis, by Φ−1. The iterative algorithm works by employing a Landweber-type operator, L : s 7→ s (L is the composition of operations (1), (2), and(3)), acting on the current signal estimate, s(i) ∈ S, and adjusting for datamisfit, in combination with a thresholding (shrinkage) operator, Sτ : c 7→ c,acting on the coefficients of the wavelet expansion of the “intermediate”

estimate s(i), denoted by c(i) =(

c(i)λ

)

. This latter operation embodies the

sparsity prior. The optimization algorithm tries to minimize the residual,∥∥∆x(i)

∥∥ =

∥∥x− x(i)

∥∥, where x is a value generated from the model, while at

the same time trying to find the sparsest signal s with respect to the basisD = ϕλ under an ℓp norm.

82

can be summarized as shown in Alg. 2. A crucial problem is the exact set-

Algorithm 2 Iterative Thresholding for Sparse Component Analysis (SCA-IT)

Let i← 0Initial estimates of S, A: let S← 0, A← Irepeat

Re-estimate the sources sl using the Iterative Thresholding algorithm(Alg. 1).Re-estimate the mixing matrix by A =

(XST

)· (λAIL + SST)−1.

Rescale A s.t. ‖al‖2 = 1, ∀l.if (∆E < ∆Etol or ∆A < ∆Atol) then

return (S,A)end ifLet i← i+ 1

until i > maxniters

ting of the threshold, τ . An approach in solving this problem is presented in

Sect. 3.5.

Hierarchical network representation

The model for SCA presented in subsection 3.2.2 can be also viewed as a

three-layer hierarchical “feedforward” network, according to a ‘signal flow’

representation. This is shown in Fig. 3.10. The three layers of the model

are the wavelet, source, and observation layers, respectively. The middle

and bottom layers taken as a pair, (sl, xd), interact via the connectionsAdl

, with the penalty term on the mixing matrix, A, modulating the

value of those connections in a ‘weight-decay’ fashion, with total connection

strength λA

∑

l

∑

dAdl2 (cf. the penalty term on A of Eq. (3.11)). The pair

of top and middle layers, (ϕλ, sl), interact via the connectionscl,λ,

with the penalty of the second term of Eq. (3.11), λC

∑

l

∑

λwl,λ|cl,λ|p, pe-

nalizing for non-sparsity, thus driving their values towards zero. All of the

above quantities also interact via the first, data-dependent term of Eq. (3.11),

83

cl,λ

A dl

...

...

...1 Λ

s s1 L

x1 xD

τ−τ

φ φ

Σ

EX

Figure 3.10: Sparse Component Analysis model as a layered “feedforward”network. This graph emphasizes the generative relationships among the dif-ferent signals in the model. In this representation, nodes represent signals(including noise “signals”) and links represent functional dependence rela-tions between signals. This network corresponds to the set of Eqns (3.8)–(3.11), along with the full model specification of that section. In particular,the nodes ϕ1, . . . ,ϕΛ are dictionary elements, s1, . . . , sL are the unobservedsources, and x1, . . . ,xD are the observations. Each set of connections in thenetwork is annotated with its respective set of parameters, as shown. Obser-vation noise EX with covariance Σ is assumed to be added to the sensors.The embedded figure shows the thresholding operation resulting from theℓ1 prior on the wavelet coefficients of the sources, cl,λ. The network andthe corresponding learning algorithm are designed such that the “activities”cl,λ are minimized under a sparse prior, leading to a sparse representationof the sources, slLl=1, in the dictionary ϕλΛλ=1.

∑

d

∑

n [xdn −∑

l Adl

∑

λ clλϕλn]2, which tries to fit the data as close as pos-

sible, given the constraints of the other two terms.

Effect of the sparse prior on the network. While the links between

the Φ and S layers in the network of Fig. 3.10 appear to suggest that the con-

nectivity pattern is full33, in reality, due to the sparse source model and the

33The connectivity pattern between the Φ and S layers of the SCA network is captured

by the matrix MΦ,S =[

MΦ,Sl,λ

]

, where MΦ,Sl,λ ∈ 0, 1, indexed by the index-set of the

wavelet coefficient matrix C, IC = 1, . . . , L × 1, . . . , Λ. A full connectivity pattern

means MΦ,Sl,λ = 1, ∀(l, λ). This network representation is especially intuitive in appli-

cations such as sparse coding. Note that in this case the ‘codes’ are the al and the‘responses’ are the sl,n.

84

resulting thresholding function, most bases will be effectively “switched off”

after learning, leading to very sparse connectivities. This can be seen by

observing the geometry of the soft-thresholding function of Fig. 3.8 in com-

bination with a typical histogram of cλ values, such as the one shown in

Fig. 4.5: a threshold τ will set to zero all values −τ < cλ < τ . The specifi-

cation of the model is such that the ‘activities’, cl,λ, are minimized, leading

to a sparse representation of the sources in D. This can be seen as an

application of the principle of parsimony.

We also note that, in common with other sparse learning models (see e.g.

Olshausen and Field, [138], p. 3323), although the network generative equa-

tions, Eq. (3.11), are linear, the learning algorithm, which results from the

sparsity constraint, is nonlinear, leading to non-trivial interractions among

the bases. This is especially true for overcomplete (non-critically sampled)

bases34.

3.4 Experimental Results

The experimental results presented in this section are divided into three

classes. The first concerns the separation of natural images, for both arti-

ficial and natural mixing conditions. The second class considers the blind

separation of more sources than mixtures. Finally, the third deals with the

extraction of weak biosignals in noise. In particular, the decomposition of

spatio-temporal fMRI datasets is studied in these experiments. In the first

part, simulated activations drawn from two different source distributions and

two overlapping conditions are separated. In the second part, experiments

on real fMRI data are used to compare SCA-IT with ICA.

34While the wavelet transform is known to largely decorrelate signals, some higher-orderinteractions remain.

85

3.4.1 Natural Image Separation

We apply the SCA-IT algorithm on two natural image datasets: the first is a

mixture of known images, the ‘Cows and Butterfly’ dataset35, and the second

is the dataset used by Adelson and Farid in [61], assessing the ability of blind

source separation algorithms in separating reflections from images. In both

datasets, we have a set of L = 2 unknown images to be estimated from a

set of D = 2 observed mixtures of them. We note that the second dataset

has proven to be a rather difficult task for many ICA algorithms. The usual

assumption of decomposing along one-dimensional independent signal spaces

has proven to be of marginal value here. The results, along with explanations,

are shown in Figs 3.11 to 3.23. Some comments on running SCA/IT on the

datasets are given next.

• For the ‘Cows & Butterfly’ dataset, we use additive noise generated

as εxd= 0.1 std(xd) ε1, where ε1 ∼ N (0, 1), and for the d–th sensor

signal.

• In some cases where the operator A was known, a near-perfect recon-

struction was possible even in relatively high noise regimes. In this

case the task is to reconstruct the unknown signals and filter the noise.

Note that this is still an inverse problem, though not blind.

• The evolution of the solution was monitored by evaluating the objective

function, J , at each iteration step, i. We monitor the value of the

optimisation functional, J , and stop iterations when the change in Jis below a certain limit (0.001).

As shown next, good separation was obtained from both datasets.

35These images can be found at http://www.cis.hut.fi/projects/ica/data/images/.

86

Artificial Mixtures

This experiment is mainly used here mainly to demonstrate the robustness

of the SCA-IT algorithm to sources that are less sparse and its potential

applicability to natural images. This can be seen by observing e.g. the

histograms of Fig. 3.18, which are far from being ‘spicky’, and the shape of

the scatterplot of Fig. 3.17.

We compare the SCA-IT algorithm with the noiseless IFA algorithm with

MoG source priors [6] using M = 4 Gaussian components36. We note that the

IFA algorithm should be suitable for this kind of task, since the multimodal

distributions exhibited in natural images can be modelled particularly well

with MoGs, as was shown in Fig. 2.11. (This fact was also exploited in [38],

using a mean-field Bayesian version of IFA.) We also note that Li et al. [111]

also use a sparse decomposition algorithm for separating images; however,

they only operate on edge-detected images, which, by their nature, are sparse.

Therefore, that task is (at least in theory) easier for a sparse algorithm.

The SCA-IT algorithm is parameterized by the threshold, τ . For this

experiment it was changed from τmin to τmax and the optimal threshold, τ ⋆,

was obtained using a subjective, visual criterion, namely that of balancing

smoothness with crispness of the resulting images. It was found that a value

of τ = 0.05 gave the best results.

Figure 3.11 shows the (unseen) latent images. These images were mixed

using the mixing matrix M =

0.7 0.3

0.55 0.45

into two “observations”.

Figure 3.12 shows the IFA solution along with its scatterplot (Fig. 3.13)

and the estimated PDFs (Fig. 3.14). It can be seen that while the source

images are recovered, there is still some presence of noise.

36The noiseless version of IFA was selected as a competing algorithm because in thistask we want to evaluate the source modelling capability of SCA-IT. Noiseless IFA allowedus to effectively focus on that aspect only.

87

Figure 3.11: ‘Cows & Butterfly’ dataset: original sources, downscaled to64× 64 pixels.

Figures 3.15 to 3.18 show the solution from SCA-IT. In particular, Fig. 3.15

contains the estimated latent images after 750 iterations of the SCA/IT al-

gorithm. Fig. 3.16 shows their comparison with the known (in this case)

sources, in the form of scatterplots of the corresponding pixel intensities,

s1 vs. s1 and s2 vs. s2. In Fig. 3.17 we see the latent source space, i.e. the

scatterplot, s1 vs. s2. Finally, Fig. 3.18 shows the empirical histograms of

the SCA/IT solution. The correlation coefficients between the original and

reconstructed sources were r1 = 0.996 and r2 = 0.998, showing a near-perfect

reconstruction.

Separating Reflections in Images

A demonstration of the SCA-IT algorithm for the separation of reflections

in an image is shown next. This is the ‘Terrace’ dataset, originally used by

88

Figure 3.12: IFA results: Separated image of (a) ‘Cows’ and (b) ‘Butterfly’.While the source images are recovered, there is some presence of noise.

Figure 3.13: Scatterplot of the ‘Cows & Butterfly’ estimated images, afterhaving been unmixed by IFA.

Adelson & Farid in [61]. A pair of images was acquired through a linear

polarizer in front of Renoir’s ‘On the Terrace’ painting framed behind glass

with a reflection of a mannequin. Note that the mixing is not artificial in

this case. The objective is to remove the reflection, so that the painting is

shown intact.

The observations are shown in Fig. 3.19 and the reconstructions in Fig. 3.21.

The threshold, τ , in this case was empirically set such that the resulting

components balance image reconstruction accuracy and smoothness: the fi-

nal reconstruction is a trade-off between noiseless and over-blurred results.

89

Figure 3.14: Probability density functions of the sources as estimated fromthe IFA algorithm. M = 4 Gaussian components were used for each sourcemodel. The brown curve is the estimated mixture density.

Figure 3.15: SCA-IT results after convergence of the algorithm: separatedimages of ‘Cows’ and ‘Butterfly’.

90

0

1

2

3

4

5

6

0 50 100 150 200 250

0

0.5

1

1.5

2

2.5

3

3.5

4

0 50 100 150 200 250

0

0.5

1

1.5

2

2.5

3

3.5

4

0 50 100 150 200 250

0

1

2

3

4

5

6

0 50 100 150 200 250

Figure 3.16: Scatterplot of the original and SCA-IT unmixed images.

0

0.5

1

1.5

2

2.5

3

3.5

4

0 1 2 3 4 5 6

Figure 3.17: The ‘Cows & Butterfly’ estimated source space, (s1,n, s2,n)Nn=1,after convergence of the SCA-IT algorithm.

91

0

20

40

60

80

100

120

140

0 0.5 1 1.5 2 2.5 3 3.5 40

50

100

150

200

250

0 1 2 3 4 5 6

Figure 3.18: Empirical histograms of the ‘Cows & Butterfly’ sources afterconvergence of the SCA-IT algorithm.

Figure 3.19: ‘Terrace’ (reflections) dataset: observations.

In Fig. 3.20 and 3.22 we can see their respective scatterplots. Finally, the

empirical histograms of the estimated sources are shown in Fig. 3.23. No-

tice how the structure in the data, appearing in the two elongated clusters

along non-orthogonal directions in data space, emerges after the algorithm

has converged. In particular compare Fig. 3.20 (before) with Fig. 3.22 (af-

ter). We finally note that the algorithm of Farid and Adelson was specially

constructed for this separation task, while our algorihm is a general-purpose

one.

3.4.2 Blind Separation of More Sources than Mixtures

The goal of this experiment is to reconstruct (‘unmix’) a set of observed sen-

92

Figure 3.20: ‘Terrace’ dataset: Scatterplot of observations, (x1,n, x2,n).

Figure 3.21: ‘Terrace’ dataset: Estimated (unmixed) source images.

Figure 3.22: ‘Terrace’ dataset: Scatterplot of estimated sources after 100iterations of SCA/IT.

93

Figure 3.23: ‘Terrace’ dataset: Empirical histograms of estimated sourcesafter 100 iterations of SCA/IT.

sor signals, xd, d = 1, . . . , D, into their constituent generators, the ‘sources’,

sl, l = 1, . . . , L. This is an instance of the blind inverse problem known as ‘the

cocktail party problem’: we seek the “causes”,sl

L

l=1, that generated the

observations but we do not know the generative mechanism in advance (hence

‘blind’). That is, we also have a system identification problem. In the par-

ticular case we are discussing, the signals are timeseries, xd = (xd,1, . . . , xd,T )

and sl = (sl,1, . . . , sl,T ), where the time-index runs from t = 1 to t = T .

We can think of the observed signals as generated from two “microphones”

placed at different positions, d = 1, . . . , D. If the number of unknown source

signals, L, is larger than the number of observed ones, D, the problem is

called ‘overcomplete’. It is of course a much more difficult problem, since the

number of unknowns is larger than the number of observations. To make the

problem well-posed, constraints must be imposed onto the learning system,

usually encoded in the form of priors.

In order to solve the inverse problem an appropriate mathematical model

of the generative mechanism must be posed, along with a model of the un-

known source signals37. We assume a linear mixing (superposition) model

37In the probablistic framework the model is the assumed probability distribution of thesource amplitudes, p(slt).

94

for the observed signals. That is, we assume

t : xdt =L∑

l=1

Adlslt + εdt ,

at each time-point t. The term εdt is a noise process. The ‘mixing’ (or ‘sys-

tem’) matrix, A =[Adl

]is also unknown. It is formed by all combinations

of indices d and l, i.e. for each ‘path’ l → d, and captures some essential

aspect of the ‘propagation medium’. The mixing matrix is also going to be

estimated by the algorithm. Note that, in the timeseries case, additional

modelling assumptions about the sources can be made; however no special

structure in the d indices is assumed. We assume, however, that the time-

series is sparsely representable in a wavelet basis. It turns out that, despite

its simplicity, the above formulation works surprisingly well for a wide variety

of problems. A representative result is shown next.

Experimental Results. We experimented with a set of three sound sources

(from Te-Won Lee et al., [107], downloaded from his Web site), shown in

Fig. 3.24, that were mixed into two observations using the mixing matrix

A =

1 1/√

2 1/√

2

0 1/√

2 −1/√

2

,

as in [107], giving the mixed signals shown in Fig. 3.25. The scatterplot of

the two observations is shown in Fig. 3.26.

95

-1

-0.5

0

0.5

1

0 5000 10000 15000 20000 25000

True source s3

-1

-0.5

0

0.5

1

0 5000 10000 15000 20000 25000

True source s2

-1

-0.5

0

0.5

1

0 5000 10000 15000 20000 25000

True source s1

Figure 3.24: Overcomplete blind source separation: true sources,sl

l=1,...,3,

(unknown).

-1

-0.5

0

0.5

1

1.5

0 5000 10000 15000 20000 25000

Mixture X2

-1

-0.5

0

0.5

1

0 5000 10000 15000 20000 25000

Mixture X1

Figure 3.25: Overcomplete blind source separation: observed sensor signals,xd

d=1,...,2.

96

-0.4

-0.2

0

0.2

0.4

-0.4 -0.2 0 0.2 0.4

Scatterplot of the data, X

Figure 3.26: Overcomplete blind source separation: scatterplot of sensorsignals,

(x1,n, x2,n)

.

Following Zibulevsky and Pearlmutter [176], and Li, Cichocki and Amari

[111], a first estimate of the mixing matrix, A, was estimated by clustering

in the transform domain38 using a spherical k-means algorithm39 [49]. In

particular the spectrogram of the observations, Xl(ωk, tm), where ωk is the

frequency at position tm, was computed40 and it is shown in Fig. 3.27. The

scatterplot of the data points in spectrogram-space, Fig. 3.28, clearly reveals

much more structure than the corresponding one in the original observation

space. Note that the spectrogram was only used for the initialization of

the mixing matrix. The SCA-IT algorithm itself used the standard wavelet

transform as described in Alg. 2.

38Note that we have control over the data matrix, X, because it is observable. Thereforewe can perform various exploratory data analyses on X and select the one that works bestfor the particular situation at hand.

39In a Bayesian formulation one could use e.g. mixtures of Von Mises distributions [8],which is a good model for directional data.

40A Hamming window of length R = 256 was used for the short-time Fourier transformwith Nw = 512 points per slice and ‘hop’ distance Lhop = 128. The sampling rate of thesignals was fs = 8000 Hz.

97

Spectrogram of x2

0

500

1000

1500

2000

2500

3000

3500

0 0.5 1 1.5 2 2.5 3 3.5 4

Spectrogram of x1

0

500

1000

1500

2000

2500

3000

3500

0 0.5 1 1.5 2 2.5 3 3.5 4

Figure 3.27: Overcomplete blind source separation: spectrogram, Xl(ωk, tm),of the sensor signals.

-200

-150

-100

-50

0

50

100

150

200

-200 -150 -100 -50 0 50 100 150 200

Scatter plot in the transform domain, (X1(ω),X2(ω))

Figure 3.28: Overcomplete blind source separation: Scatterplot of sensorsignals in the transform domain,

(X1,n, X2,n)

, where n = n(ωk, tm) is the

single index that results from flattening each two-dimensional spectrogramimage into a long vector.

Clustering on the unit hypersphere SD−1 using the spherical k–means

98

algorithm results in the following direction angles for the mixing vectors:

θ1 = 34.9, θ2 = 89.9, θ3 = 144.1. These were used to compute the initial

mixing matrix, which was

Ainit =

cos(θl)

sin(θl)

=

0.82031 0.00083 −0.81029

0.57192 1.00000 0.58603

.

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

0 5000 10000 15000 20000

Estimated source s3

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

0 5000 10000 15000 20000

Estimated source s2

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

0 5000 10000 15000 20000

Estimated source s1

Figure 3.29: Overcomplete blind source separation: estimated sources,sl

l=1,...,3, by the SCA/IT algorithm.

We also let the threshold vary such that the (subjective) criterion of

clarity of the separated signals was maximized. The “optimal” threshold

was set to τ ⋆ = 10−3. The SCA-IT algorithm was then run for 1000+1000

iterations, first keeping the mixing matrix fixed to its initial value and then

unclamping it, giving the reconstructed sources are shown in Fig. 3.29. The

reconstruction accuracy as measured by the correlation between the true and

estimated sources was r11 = 0.860, r22 = 0.938, and r33 = 0.949, respectively.

99

For comparison, the corresponding results from Lee et al. [107] were r11 =

0.890, r22 = 0.927, r33 = 0.933. Girolami [76] report results using the same

dataset that are better than Lee et al.’s, however41.

Despite the rather “crude” value for the threshold, the fact that the tem-

poral structure of the signals was not taken into account, and the fact that we

“ignored” sensor noise for this experiment, the results are good. We observe,

however, that the sparser the sources are (in the wavelet dictionary used),

the better the results of the algorithm. This can be seen from the scatterplot

of the observations, Fig. 3.26, where the source at angle 45 is very diffuse.

This source is estimated less well by the algorithm, because it cannot find the

corresponding separating direction as well as that of the other sources (recall

that estimation of the mixing matrix is performed in the original observation

domain by SCA-IT). In fact using the wavelet transform is not the best basis

for audio source separation. Zibulevski et al. [177] used a short-time Fourier

transform as their basis, Φ, which according to their results proved to be the

most efficient representation for voice signals. Moreover, as noted by Li et

al. [111], if the signals overlap significantly (in the transform domain) perfect

reconstruction is not possible, even if we know the mixing operator exactly.

Experimenting with various bases is still an open research question.

3.4.3 Extracting Weak Biosignals in Noise

Many physical and biological phenomena can be better studied in a ‘hidden’

signal space that best reveals the internal structure of the data. In many cases

this space may be adaptively inferred from the data itself42 by searching for

a transformation from the observation space to that hidden space such that

the resulting (transformed) dataset has certain desired properties, captured

41Both Lee et al. [107] and Girolami [76] use a signal-to-noise ratio (SNR) measure toreport their results.

42As oposed to using fixed transformations, such as wavelet analysis, for example.

100

by appropriate prior constraints. Often the data analysis problem can be

formulated as a decomposition problem. In the functional neuroimaging field,

for example, researchers have proposed the decomposition of spatio-temporal

functional MRI data into “independent” spatial components [125]. An fMRI

dataset is obtained by measuring the ‘haemodynamic response’ (change in

blood flow) of tiny volume elements (‘voxels’) of the brain43, denoted by v

and serving as space indices, to an external stimulus at a particular time,

t [104], [133]. Measurements are performed in a brain scanner specifically

designed for this purpose. The primary form of fMRI uses the blood oxygen

level-dependent (BOLD) contrast44 [134]. The aggregate of all voxels in

space, Ω = v, gives us a three-dimensional picture of the brain. This is a

particular “snapshot” at time t; repeating this process for a period of time,

discretized at 1, . . . , T , allows us to record the spatially-distributed dynamics

in the brain. The haemodynamic response is related to the neuronal activity

in the brain, and thus, indirectly, neuroscientists may capture certain aspects

of relevant brain processes. (This is an oversimplified description of the

process, however it is sufficient for our purposes.)

Mathematical Model of fMRI Source Separation. Mathematically,

the observations, X =xt,v : t ∈ t1, . . . , tT, v ∈ Ω

, where Ω ⊂ E3 is a

volume in 3D space and t1, . . . , tT ⊂ R+ is a time interval sampled at ∆t

(called ‘repetition time’ and denoted by TR in functional neuroimaging), form

a three-dimensional scalar-valued45 time-varying field, (t, v) 7→ xt,v. It is of

interest to neuroscientists to decouple the time-evolution of voxels from the

43In a sense, this is equivalent to the Eulerian formulation in Physics, where a quantityis monitored inside a control volume fixed in space.

44Contrast agents are substances whose purpose is to alter the magnetic susceptibilityof tissue or blood, leading to changes in MR signal.

45More precisely, the data are measured in k–space (Fourier space), so the measurementsare actually complex; however these are usually transformed in real-space before furtherprocessing.

101

space-variation in the brain response. Therefore, the aim here is to factorize

this spatiotemporal field in terms of a factor capturing the spatial variation

times a factor capturing the dynamics. “Unfolding” the domain Ω into a

linear index set46 IX =1, . . . , N

, where N is the total number of voxels,

N = |Ω|, and putting the observations at time t, xt, into the t–th row of

a T × N data matrix, X, analysis of fMRI data into components can be

expressed as

XT×N≈ A

T×L· S

L×N=

(

ST

N×L·AT

L×T

)T

. (3.20)

A decomposition is sought such that the set of the column vectors of the “mix-

ing” matrix47 A, alLl=1, contains the prototypical ‘time-courses’,Al(t) :

t ∈ t1, . . . , tTL

l=1, and the set of the rows of S contains the corresponding

‘spatial maps’,Sl(v) : v ∈ Ω

L

l=1. The spatial maps can be thought of

as an image of the response of the brain to the stimulus, at each voxel, v.

The observed time-course of voxel v, according to this model, is obtained by

modulating those prototypical timecourses,al

L

l=1, by sl,v:

v : xv =

L∑

l=1

sl,val + εv , (3.21)

where the residuals, εv, can be thought of as “noise”. Under certain neuro-

physiological assumptions (see [71], [124], for example), a subset of those pro-

totypical timecourses should somehow “resemble” the stimulus timecourse.

46This unfolding operation is a bijection: we can always recover the 3–dimensionalvolume if we need to. In fact, after processing, the results are displayed in their originalspatial form.

47In using the notation shown in Eq. (3.20), we adopt the ‘spatial’ formulation of compo-nent analysis (sCA) for the problem of decomposing fMRI timeseries. Strictly speaking, intraditional BSS terminology the matrix ST should be called the mixing matrix, since it isthis matrix that contains the mixing coefficients, sv,lLl=1 for the “regressors” alLl=1, foreach voxel, v (see the equation for the observed time-evolution of the voxel xv, Eq. (3.21),below). In fact, in the so-called ‘temporal component analysis’ (tCA) model, the dimen-sionalities are exactly reversed. We will keep the terminology used in the fMRI communityhere and use the term ‘mixing matrix’ for A.

102

Component analysis studies exploit this fact in practice to assess the success

of the decomposition, by testing whether some al resulting from the analysis

correlate well with the stimulus48 (and the resulting spatial maps show acti-

vated regions that are neuroscientifically plausible—such that, for example,

if we have a visual stimulus, we expect (parts of) the visual cortex to acti-

vate; complex interactions with other brain areas may also be present). The

hidden data space mentioned in the introductory paragraph is exactly the

S–space. Our task is then to estimate bothsl,v

and

al

.

Extracting useful neuroscientific information from fMRI datasets is far

from trivial, however. fMRI images are very noisy and the signals of interest,

the so-called ‘activations’, have typically very small signal power. To make

things worse, the changes we are looking for are similarly small in scale, on

the order of 3% (Lazar, [106] p. 54). Finally, the measurements are in a very

high dimensional space, making data analysis even more challenging.

The above issues have led to a long line of research on relevant methods.

Component analysis, especially ICA, has been successfully applied to 4D

spatiotemporal fMRI data in two different “flavors”: temporal ICA (tICA),

where each source is a timeseries, and spatial ICA (sICA), where each source

is an image [161]; see [23] for a nice overview. McKeown et al. [125] is

probably the first research paper proposing the decomposition of fMRI data

in independent spatial components. Since then, numerous studies have been

contacted under this decomposition framework in fMRI, with considerable

success. To address certain deficiencies of “classical” ICA, such as square-

only mixing and lack of dimensionality and noise estimation, Beckmann and

Smith [15] propose a probabilistic version of ICA (PICA). PICA extends

spatial ICA using a noise model in the form of a signal subspace identifica-

48More precisely, they are tested for correlation with the expected time-course, which isgotten by convolving the experimental paradigm with the haemodynamic response func-tion.

103

tion method, utilizing singular value decomposition/probabilistic PCA; this

allows model order selection. The separation is obtained via the FastICA

algorithm. In addition, inference of the activated voxels is obtained, after

separation, using an (external) mixture of Gaussians model such that the

bulk of the voxels, which correspond to non-activation, is modelled by a sin-

gle ‘fat’ Gaussian, while the activated ones are modeled using the rest of the

Gaussian components. Calhoun et al. [24] studied the spatial and temporal

ICA decomposition of fMRI data-sets containing pairs of task-related wave-

forms. Four novel visual activation paradigms were designed for this study,

each consisting of two spatiotemporal components that were either spatially

dependent, temporally dependent, both spatially and temporally dependent,

or spatially and temporally uncorrelated, respectively. Spatial and tempo-

ral ICA was then used on these datasets. The general result was that each

ICA mode “failed” to separate the components that were not independent

according to its assumptions, i.e. spatially independent for sICA and tem-

porally independent for tICA, and “worked” otherwise. Inspired from this

work, Benharrosh et al. [17], reported results of applying ICA algorithms49

to fMRI from the point of view of relating the statistical and geometrical

characteristics of the data to the properties of the algorithm. In particu-

lar issues such as separatedness (overlapping sources), uncorrelatedness and

independence were discussed. Continuing on this theme, Daubechies et al.

[46] conducted a large-scale experimental and theoretical investigation of the

underlying principles of decomposing brain fMRI data using ICA. A new

visual-task block-design fMRI experiment, inspired by [24], using a super-

position of two spatiotemporal patterns was used. There were 17 runs for

each subject, each run resulting in 238 full brain volumes of 15 slices each.

This research lead to the rather surprising result that while ICA algorithms

are in theory built such that they seek independent components their appar-

49Using InfoMax as implemented in NISica and FastICA as implemented in Melodic.

104

ent success is due to the sparsity of the sought-for sources, at least in the

fMRI decompositions. There is no biological reason one should strive for

independence in brain processes. Indeed, sparsity was identified as the main

contributing factor in the success of the method for fMRI. So the paper con-

cluded with the proposal of building models and algorithms that explicitly

optimize for sparsity.

While this present text is more ‘methods’-oriented, taking a machine

learning perspective to data analysis/BSS, this work was mainly inspired by

experiments of the neuroscience community in functional MRI [17], and it is

a continuation of the ideas suggested in Daubechies et al. [46].

Simulated Data

In this subsection we repeat the simulations of Daubechies et al. [46] us-

ing the SCA-IT model. This experiment is designed to assess the quality of

separation of various BSS algorithms in well-controlled situations. In par-

ticular, it is designed to help identify the regimes under which the original

BSS algorithm used in [46], that is ICA, may succeed or fail. As was shown

in [77] and [46], ICA surprisingly sometimes fails in situations that should,

by design, succeed, i.e. in some mixing situations where the components are

independent, or nearly independent, and, even more surprisingly perhaps,

manages to successfully separate mixtures which are far from independent.

The data were generated by simulating two spatial components and mix-

ing them into two “observations”, at times t1 and t2. Each component com-

prised of an “activated” region (a rectangle) embedded in a noisy background.

We use the following notation: Sl, l = 1, 2, denotes the l–th component

(‘spatial map’) reconstructed by the algorithm, with domain Ω. Ωl ⊆ Ω is

the activated region, with support the lighter area. Bl are “background”

noise processes, whose supports are the complements in Ω with respect to

105

the activated regions, Bl = Ω \ Ωl. We will also find useful to define B,

the complement in Ω with respect to the union of the activated regions,

B = Ω\ (Ωl∪Ωl′). The activated regions of the two components may overlap

in Ill′ = Ωl ∩ Ωl′ . Note that since the overlap, Ill′, differs among different

cases, |B| is not constant. Finally, Zl′ = Ωl′ \Ωl, where l′ = 2− l+ 1, is ‘the

other zone’, i.e. the “footprint” of the other activation in this source. The

sources were then defined as

Sl(v) = χΩl(v)xl,v +

[1− χΩl

(v)]yl,v, l = 1, 2 ,

where χA(v) is the characteristic function on A and where x1, x2, y1, y2 are

four independent random variables of which, as v ranges over Ω, the xl,v and

yl,v are independent realizations. Finally, the mixtures were generated by

X(t1, v) = 0.5S1(v) + 0.5S2(v)

X(t2, v) = 0.3S1(v) + 0.7S2(v).

Each activated spatial process had a cumulative distribution function (CDF)

of the form Φx(u) = 11+e2−u or Φx(u) = 1

1+e2(2−u) , depending on the example.

The background process had a CDF of the form Φy(u) = 11+e−1−u . The

probability density functions of x and y are then ϕx(u) = Φ′x(u) and ϕy(u) =

Φ′y(u). The above functional forms were chosen so that, for the first form, the

ICA algorithms used provided “optimal detectability”, as the nonlinearity

used in the InfoMax ICA algorithm was 1(1+e−x)

, adapted to heavy tailed

components, while, for the second, there was a slight missmatch, as can be

expected in real applications. Finally, the non-linear function approximating

the negentropy in the FastICA algorithm was g(y) = y2 and the “symmetric

approach” of the algorithm was used.

Each case is then defined by two parameters:

106

1. The overlap between activated regions, Ω1 and Ω2. This also defines

the spatial (in)dependence between S1 and S2, according to Golden [77]

and Daubechies et al. [46]; see below.

2. The functional form of the PDF (CDF) of the activations, Φ1,x(·) or

Φ2,x(·).

Geometrically, the objects to be defined for each case were then: Ωl, Zl′, Bl,

Ill′ , as defined above (note that l′ = 3 − l). Golden [77] and Daubechies et

al. [46] give the condition for spatial (in)dependence between S1 and S2, as:

p12 = p1p2 ⇔|Ω1 ∩ Ω2||Ω| =

|Ω1||Ω| ·

|Ω2||Ω| . (3.22)

Now, the cases are described in p. 10418 (i.e. p. 4) of Daubechies et

al. [46]. In particular,

• Case 1: In this example, Ω1 = 11, . . . , 40 × 21, . . . , 70, and Ω2 =

31, . . . , 80 × 41, . . . , 80. By Eq. 3.22, S1 and S2 are independent.

For the CDF, Φx(u), we choose Φ1,x(u) = 1/1 + e2−u.

• Case 2: All choices are identical to Example 1, except that the CDF, Φx,

is picked differently: Φx(u) = Φ2,x(u) = 1/1+ e2(2−u). The components

S1 and S2 are still independent.

• Case 3: Example 3: A different variant of Example 1. The location of

Ω2, now shifted to 43, . . . , 92 × 53, . . . , 92. The new Ω2 and Ω1 do

not intersect, i.e., S1 and S2 are spatially separated, not independent.

The PDFs of S1 and S2 are unaffected.

• Case 4: The sets Ω1 and Ω2 are as in Example 3, but Φx is as in

Example 2; S1, S2 are not independent.

107

Regarding the above four cases, we note that Cases 1 and 2 should be sepa-

rable by ICA, while Cases 3 and 4 need not be.

A source separation algorithm is deemed successful in separating the com-

ponents in this task if there is no remaing “shadow” of ‘the other zone’. This

is reflected in the histograms as a line-up of the corresponding areas. In

particular, if Zl′ in the source Sl cannot be differentiated from B, then the

algorithm was successful. For overlapping sources (that is, overlapping Ω1,

Ω2, cases 1 and 2, where the overlap is Ill′ 6= ∅), we must test for the addi-

tional requirement that Ill′ is not discernible from Ωl.

The results of applying ICA on the data set, taken from [46], is shown

in Fig. 3.30. Both Infomax ICA and FastICA fail to separate Case 2, while

they manage to separate the non-independent Case 4. On the contrary, as

can be seen from Fig. 3.31, the SCA-IT model successfully separates all four

cases. The threshold τ was empirically set to 0.1 for all four cases for this

experiment and we run 500 itarations of the algorithm.

Real Brain Data

We next apply the SCA-IT algorithm to a particular real-world fMRI ex-

periment. This is an fMRI timeseries used by Friston, Jezzard, and Turner

in [71]. It was chosen because it is small enough to illustrate the SCA/IT

model in an actual neuroimaging analysis situation without requiring special

programming techniques for handling very large datasets50. Results from

more complex fMRI data and comparisons with the state-of-the-art will be

presented in Chapter 5.

For this data set, brain scans were obtained every 3s from a single subject

using a 4.0–T scanner, resulting in 5mm-thick image slices of size 64 × 64

voxels through the coronal plane. Sixty scans were aquired. The stimula-

50Furthermore, it allows us to compare results with a stochastic (Monte Carlo) solution,in the future.

108

Figure 3.30: Simulated fMRI data. Unmixing the 4 mixtures of rectangularcomponents described in the text. (Left) PDFs of original 2 components, ofcomponents as identied by InfoMax and by FastICA. Color coding: the whole(blue), the active region (red), the ‘background’ (green); In the ICA outputs,the purple PDF corresponds to the zone associated to ‘the other component’;the background PDF is then for the area outside Ω1 and Ω2. Separation iscompletely successful only when the purple and green PDF line up, i.e., incases 1, 3 and 4, but not in case 2. (Right) False-color rendition of unmixedcomponents for Cases 2 and 4, as obtained by InfoMax, FastICA, and amore sophisticated ICA algorithm that learns PDF distributions [6], [151].InfoMax and FastICA do better in Case 4 (separated components) than inCase 2 (independent compoenents); for the pdf-learner it is the converse.(Figure from ref. [46].)

109

Case 1: Component 1

0

200

400

600

800

1000

1200

-4 -3 -2 -1 0 1 2 3 4

Case 1: Component S1 S1 Ω1 B1 Z2 I12

Case 1: Component 2

0

200

400

600

800

1000

1200

1400

-2 0 2 4 6


Case 2: Component 1

0

200

400

600

800

1000

-3 -2 -1 0 1 2 3 4


Case 2: Component 2

0

200

400

600

800

1000

1200

-4 -3 -2 -1 0 1 2 3 4


Case 3: Component 1

0

200

400

600

800

1000

1200

-4 -3 -2 -1 0 1 2 3 4

Case 3: Component S1 S1 Ω1 B1 Z2

Case 3: Component 2

0

200

400

600

800

1000

1200

-4 -3 -2 -1 0 1 2 3 4


Case 4: Component 1

0

200

400

600

800

1000

-4 -3 -2 -1 0 1 2 3


Case 4: Component 2

0

200

400

600

800

1000

1200

-5 -4 -3 -2 -1 0 1 2 3 4


Figure 3.31: Results of the SCA-IT model applied on the simulated data ofsubsection 3.4.3, for Cases 1–4. Each separated component is shown withits corresponding set of histograms for the areas mentioned in the text to itsright. The color scheme is the following: component (Sl) blue, activated area(Ωl) red, background (Bl) green, ‘other zone’ (Zl′) magenta, and intersection(Ill′) yellow. The green vertical line has abscissa equal to the backgroundmean, and the red one has abscissa equal to the activation mean. The goalof the separation is to align the histograms of the background and ‘otherzone’ so as to be indistinguishable, without ‘ghosting’ residuals; the same forthe activation and intersection regions.

110

tion, provided by light-emitting goggles at 16Hz, followed a block design,

alternating between ‘off’ and ‘on’ states. In particular, it was off for the first

10 scans (30s), on for the next block of 10, off for the third, etc. See ref. [71]

for more neuroscience-specific details. The original data were interpolated

to 128× 128 voxels in [71], so we followed the same procedure, bringing the

data to the same resolution by simple bicubic interpolation, in order to be

able to compare results.

The results from a Statistical Parametric Mapping (SPM) analysis of

the data from Friston, Jezzard, and Turner [71] are shown first. Figure 3.32

shows the extracted spatial map and Fig. 3.33 the corresponding timecourse.

Due to the nature of this method, which essentially uses correlation in order

to extract activated voxels, large “blobs” of voxels spanning large brain areas

appear to be activated; we see a typical result of that method in the spatial

map of Fig. 3.32. The thresholded spatial map (Fig. 3.34) gives a clearer

picture, showing regionally specific activations. A p–value of 0.05 was used,

resulting in a threshold of 3.97. Figure 3.34 shows the resulting ‘excursion

set’. We will use this, semi-supervised, result as our ‘golden standard’ to

assess the quality of the separation of SCA for this dataset.

We compared this result with the result from an Independent Component

Analysis (ICA) of the same data. ICA is widely used as one of the major

unsupervised learning methods in fMRI analysis and is now regarded as the

standard for ‘model-free’ analyses. As in Daubechies et al. [46], for each of

the separated components we computed the correlation coefficient, r, between

the associated timecourse, al(t), and the ‘expected timecourse’, that is the

time-paradigm, ϕ(t), convolved with the haemodynamic response function,

h(t). The component with the highest value of r was identified as the CTR

component map, C(v).

The analogous ICA decomposition, however, did not produce a single

spatial map that exhibited all the “features” (voxel clusters) shown in the

111

Figure 3.32: Spatial map from SPM (Statistical Parametric Mapping) forthe real brain dataset (coronal plane). (From Friston et al. [71].)

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 20 40 60 80 100 120 140 160

Figure 3.33: Timecourse from SPM (dots). The stimulus (‘time-paradigm’,ϕ(t)) is the box-like, continuous curve (blue). (From ref. [71].)

112

Figure 3.34: Thresholded spatial map from SPM using threshold u = 3.97(p-value 0.05) (from ref. [71]).

SPM spatial map. That is, although the ICA CTR map s1 (i.e. the “best”

component) is quite good in terms of correlation coefficient (r = 0.916) and

seems to capture most of the spatial pattern, comparing it with the excur-

sion set of Fig. 3.34 it is evident that it lacks certain details. In order to

capture the missing spatial information, we selected additional IC compo-

nents, provided that their corresponding timecourses correlated most with

the stimulus (above a certain standard threshold, r > 0.3). (It turned out

that only one additional component was eventually required. Its correlation

coefficient was r = 0.896.) Figures 3.35 and 3.36 show the resulting spatial

maps, s1, s2, and associated timecourses, a1, a2, from ICA. The component

maps were ranked according to their similarity to the SPM map, taken here

as the “ground truth”. The missing details of the spatial map s1 indeed

appear in the second spatial map, s2. In other words, and comparing the

ICA maps with the reference thresholded SPM map of Fig. 3.34, we see that

activation information is spread between two components in this case.

113

Figure 3.35: Spatial maps s1, s2 from ICA.

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

10 20 30 40 50 60

3rd IC: timecourse a3

IC timecourseExperimental design

Expected CTR timecourse

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

10 20 30 40 50 60

4th IC: timecourse a4

IC timecourseExperimental design


Figure 3.36: Timecourses a1, a2 from ICA, corresponding to the spatial mapss1, s2 of Fig. 3.35.

114

Sparse Component Analysis

We run SCA-IT on the dataset in order to detect ‘Consistently Task Related’

(CTR) (“predictable”) components. Figure 3.37 shows the CTR spatial map,

s1, extracted by SCA. The corresponding timecourse of the component, a1,

shown in Fig. 3.38 shows a very good correspondence with the expected time-

course. The correlation coefficient was rSC1 = 0.940. This component can be

interpreted as one that is due to a single physiological process. It matches

the result from Friston et al. [71] very well. Note, however, that SCA is

completely unsupervised. That is, as in ICA, the prototypical timecourse is

estimated from the data itself. In the CTR map produced by SCA, however,

(Fig. 3.37), the pattern of activation appears in the same component, as it

should be. Recalling that the reason for the decomposition was to decouple

spatial variation from the temporal evolution of the brain activations in a

parsimonious manner in the first place, these results shows that SCA per-

forms qualitatively better in extracting those “prototypical timecourses” and

their corresponding CTR maps from the data than ICA.

Following Daubechies et al. [46] again, in order to quantitatively com-

pare the spatial maps of the above decompositions, we computed the receiver

operating characteristic (ROC) curves between the (thresholded) SPM map,

taken as a benchmark (‘gold standard’), and the SCA and ICA maps, re-

spectively51. The aim is to quantify the similarity of the spatial patterns

produced by SCA/ICA and SPM. ROC curves capture the trade-off between

‘sensitivity’ and ‘specificity’ of an algorithm as a graph between the false pos-

itive rate, FP (or ‘false alarms’), taken as the abscissa, X, and true positive

rate, TP (or ‘hits’), taken as the ordinate, Y . If we threshold an extracted

spatial map at a certain threshold, γ, we get a binary image in which voxels

51It is standard practice, in the use of component analysis for fMRI, to threshold theCTR component obtained. If a binary ground truth of activation is available, one mayassess the quality of the results by means of ROC curves. Typically such “ground truth”maps are not available, however [46].

115

Figure 3.37: Spatial map s1 from SCA. The component whose correspond-ing timecourse correlated most with the stimulus/expected timecourse wasidentified as the CTR component.

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

10 20 30 40 50 60

Timecourse a1 from SCA decomposition

SC timecourseExperimental design


Figure 3.38: Timecourse a1 from SCA, corresponding to the spatial map s1

(thick curve). The time-paradigm, ϕ(t), is the blue curve and the expectedtimecourse, h(t)⊗ ϕ(t), is the green one.

116

above the threshold, sl,v ≥ γ, are assigned a label 1, meaning ‘activated’, and

those below it are assigned a label 0, signifying ‘non-activated’ voxels. As

we set the threshold lower and lower, more and more activated voxels will be

included in the ‘activated’ set, but so will false positives. The ROC curve is

constructed here by varying the threshold, γ, of the CTR component maps

between their minimum and maximum values. Then the pair (FP, TP ) be-

comes a γ-parameterized curve, (X(γ), Y (γ)), between the points (0, 0) and

(1, 1). A random (chance probability) result would result in a diagonal linear

segment between these two points, while a perfect result would be a Γ–

shaped curve between the points (0, 0)–(0, 1)–(1, 1). In reality, ROC curves

fall between these two extremes in the ROC space.

In the reference paper of Friston et al. [71], only a 36×60 window (‘region

of interest’, ROI) was used. We performed the analysis into components on

the whole images, but we selected the same ROI after the decomposition for

comparison purposes. Because we did not have the origin of their window

available, we estimated it using a simple correlation matching scheme. The

result is shown in figure 3.39. The peak of the correlation indicated a shift

of origin to coordinates (xo, yo) = (24, 38). Visual inspection showed a near

perfect match.

The result of the ROC analysis, shown in figure 3.40, shows a consid-

erable improvement in decomposition between the ICA and SCA maps. In

particular the ‘ROC power’, as measured by the ‘Area Under the Curve’

(AUC), was A′SC = 0.861 for the SCA CTR component, and A′

IC1= 0.801

and A′IC2

= 0.721 for the best and second-best ICA components, respec-

tively. Finally, we follow Skudlarski [157] and choose an acceptable false

positive probability of 0.1, shown as the vertical line in Fig. 3.40.

We found the selection of the wavelet basis not to be very critical for the

results, as long as we retained a certain level of smoothness. The Daubechies’

most-compactly-supported wavelet D8 was used in the example shown here.

117

20

40

60

80

100

120

140

160

20 40 60 80 100 120 140 160 180

Figure 3.39: Correlation matching between the window used by Fris-ton et al. [71] and the full dataset. The peak of the correlation “image”clearly appears near the center of the rectangle, as a dark “dot”.

For the fMRI data-set discussed here, the run took approximately 10min

CPU time for 2000 iterations on an Intel ‘Centrino Duo’TM processor using

a program written in the MatlabTM language (running the GNU Octave free

software implementation under the Linux OS)52.

Since the defining characteristic of our SCA method is its emphasis on

sparsity, and recalling the analysis in the paper of Daubechies et al. [46], we

conclude that this result is due to imposing a more efficient, sparsity prior

on the spatial maps.

We finally note that, as with any mathematical model applied to ‘real-

world’ data, the semantic interpretation of the results of SCA decomposition

can only be performed according to knowledge in the particular application

domain.

52This time is for decomposing the data in L = T = 60 components. While every effortwas made to produce a fully vectorized program, no other optimization was attempted.We expect that using a compiled language will reduce the execution time considerably.

118

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

SCAICA1ICA2

Figure 3.40: Receiver Operating Characteristic (ROC) curves between theSCA-IT derived CTR map and SPM (red curve) and ICA-derived and SPMones (green curve: best ICA component, s1; blue curve: second best ICAcomponent, s2). The diagonal line represents the line of chance-accuracy.The further above this line the ROC curve is, the better the algorithm underinvestigation performs. This can be quantified using the ‘area under thecurve’, AUC. The vertical line separates the acceptable region of the curve(to the left of the vertical line) from the unacceptable one. In the neurosciencecommunity, the acceptable probability of false positives is in general very low,resulting in a strict threshold. The upper bound of 0.1 for FP in this textcomes from Skudlarski [157].

119

3.5 Setting the Threshold

An important modelling decision is the setting of the threshold, τ , in the

algorithm. This value could be set empirically as in subsection 3.4.1 to

obtain a balance between sharpness and noise [45].

We note here that if we pick the value of the threshold in the vicinity

of the “optimal” one then the algorithm converges to a good answer. If the

threshold is set too high, however, we may get overblurred results and if it

is set too low we get noisy ones. One can in theory re-run the experiments

by altering the value of the threshold and picking the value that optimizes

the ROC power, for example. However this is not a very practical approach

in practice, because there is no analytic expression that relates these two

quantities, so one has to iterate through the whole process many times, until

one finds the optimum value for τ . This is very computationally demanding.

More importantly, in practice in exploratory analyses one may not have the

ground truth in the first place.

Another option is to try to estimate it from the data as well. We use

the following heuristic. First observe that due to the bilinear nature of the

optimization functional of Eq. (3.11) with respect to (A,S), we may ex-

change scale factors between these two quantities without the energy, E ,being changed, provided that we rescale the other quantity accordingly. We

exploit this indeterminacy by making the mixing vectors, al, to be unit vec-

tors, el(θl), by rescaling their 2–norms to be equal to unity, ‖al‖2 = 1. In

effect, we let al evolve on the surface of a unit hypershpere, SD−1, with

only their direction angles, θl, be free parameters. Li et al. use a similar

constraint in [111]. We also set λA to be equal to 1. (Since the purpose of

the penalty on A (third term of Eq. (3.11)) is to keep it from growing too

big, and since we already rescaled al to unity, this does not really effect

the solution.) The above makes E[· · · ;λC, λA

]a function of the Lagrange

120

multiplier λC only.

Now, the obvious but important observation is that the actual threshold

value should depend on the relative magnitude of the wavelet coefficients

to be thresholded, with respect to the ones to be retained. Recalling the

empirical definition of sparsity given in Section 3.2.2 and Footnote 13, that

“most” coefficients should be “small” and only “a few of them” should be

“significantly different than zero” (the former being the ones that should be

filtered out and the latter the ones to be retained), one should give a concrete

formula implementing this concept. For an ℓ1 norm prior, an estimate of the

‘dispersion’, δcl, of the wavelet coefficients of the l–th source, cl,λ, around

their (zero) mean is

δcl=

1Λ

∑Λλ=1 cl,λ

2

2, (3.23)

based on a first estimate of the values53 of cl,λ. In doing this, we implicitly

assumed that the cl,λ are equivalently drawn from a Laplace distribution with

scale parameter 1/√δcl

(Zibulevsky et al., [176]).

We also make an estimate of the observation noise level, lν , as the vari-

ance of a white Gaussian process using a technique inspired by the ‘universal

thresholding’ technique of Donoho and Johnstone [53], by decomposing the

observations, X, in wavelet space and assuming that the finest-scale coeffi-

cients, xt,λJΛJ

λJ=1, are due to noise. A robust estimate of the square root

of the noise level is given by the median absolute deviation (MAD) of the

wavelet coefficients at the smallest-scale (Donoho, [54]),

σνMAD= Kmad(xλJ

) , (3.24)

53This estimate can be obtained from a previous decomposition, for example. For theclass of signals we are interested in in this application, a first ICA decomposition may givea very good starting point. Note that we are not so much interested in a precise estimateof the actual values of the wavelet coefficients of the sources here, but rather on gettingan estimate of the ‘summary statistic’, here scale parameter, δcl

.

121

where K = 1/Φ−1(3/4) ≈ 1.4826 and Φ−1(·) is the inverse of the cumulative

distribution function for the standard normal distribution. The median ab-

solute deviation (MAD) robust statistic, a measure of dispersion, is defined

as

mad(xi) = mediani(|xi −mediani(xi)|) . (3.25)

(Another option could be to use a technique proposed by Beckmann and

Smith for their probabilistic ICA algorithm [15], and consider the smallest

eigenvalues of the data matrix as corresponding to noise.) The above estimate

is used for the first ‘epoch’54 of training. A better estimate is obtained in

the second phase of the algorithm by employing the empirical statistic of

Eq. (3.8), as

σ2ν =

1

(T − L)NTr[(X−AS)T(X−AS)

], (3.26)

using the estimates of (A,S) from the previous phase, where the denominator

is the ‘effective number’ of parameters [147].

Using the above, we finally estimate the threshold, τl, using the equation55

τl = 2lν√δcl

. (3.27)

A schematic representation of the above steps is shown in Fig. 3.41.

The above procedure works quite well in practice. An alternative proce-

54An epoch is defined here as a batch of iterations during which the control parameterssuch as the threshold, τ , are held constant and only the unknown signals and modelparameters are updated.

55This equation corresponds to viewing the terms of the optimization functional,Eq. (3.11), in terms of the equivalent stochastic quantities described above, making use ofthe χ2 for the observations interpretation of Eq. (3.8) with Σ = σ2

XID, and multiplying

through by 2σ2X

. Note that the value of the optimizer does not change if we do this, sinceargminx

cE(x)

= argminx

E(x)

, ∀c ∈ R⋆. The above results in λcl

= 2σ2X

wl, with

wl.= 1/

√δcl

.

122

lν τ δC

X (S,A)

i = 0

i = 1, . . .

Figure 3.41: Schematic representation of the Iterative Thresholding algo-rithm for blind signal separation as implemented in our system. The red boxrepresents the IT algorithm proper (Alg. 2), parameterized by the threshold,τ . Arrows represent flow of information between the various variables in themodel. During the initialization step, i = 0, the data, X, along with aninitial value of S are used in order to obtain a first estimate of the thresholdusing the procedure described in the text (dashed arrow and right-hand sideblue and green arrows, respectively). The algorithm is run for a first epochusing this estimate, after which we obtain a new estimate of (S,A). This isused to obtain a better estimate of the threshold, which we use for the finalepoch run.

dure could be one that finds a “universal” (i.e. typical) scale parameter δcl

for this class of problems, as was done in [150], for example. In Chapter 4,

however, we propose a more principled, hierarchical Bayesian approach to

estimating the threshold. In particular, we propose to extend the determin-

istic IT method and use a variational approach for blind inverse problems in

a probabilistic, generative modelling framework. This will allow us to learn

under uncertainty and estimate important parameters of the model from the

data itself. We take the approach of mean-field variational bounding to the

likelihood and derive a variational, EM-type algorithm for MAP estimation.

Under this formulation, a “soft” version of iterative thresholding emerges as

a module of an extended set of update equations that include estimators for

the controlling ‘hyperparameters’ of the model.

123

Chapter 4

Learning to Solve Sparse Blind

Inverse Problems using

Bayesian Iterative Thresholding

4.1 Introduction

In the previous Chapter, we formulated the optimization functional for

SCA, Eq. (3.11), from an energy minimization viewpoint (Olshausen and

Field, [137]). This objective function explicitly optimizes for maximal spar-

sity, capturing the physical constraint of minimum entropy decomposition

(Foldiak, [67]; Zemel, [175]; Olshausen and Field, [137]; Harpur and Prager,

[80]; Field, [64]; Barlow, [10]; Barlow et al., [13]; Atick and Redlich, [4], [5]).

The variational functional, E , is parameterized by two Lagrange multipliers,

λC and λA, for the two penalty terms corresponding to the expansion coef-

ficients of the sources and to the mixing matrix, respectively. Setting these

correctly can be quite tricky, often leading one to resort to ad-hoc methods.

124

Another delicate issue, directly related to the above, is the choice of the

threshold value, τ , in the iterative thresholding algorithm, used during the

optimization of the objective.

In order to address these issues, one source of inspiration might be anal-

ogous ideas developed in the field of Bayesian statistics. An example of this

methodological approach is the interpretation of ‘early stopping’ in neural

networks training as Bayesian regularization by MacKay [117]. Olshausen

and Field [138], Lewicki and Sejnowski [109], Girolami [76], Li, Cichocki,

and Amari [111], and Teh et al. [163] also propose sparse and ovecomplete

representations in a probabilistic framework stemming from the pioneering

work of Olshausen and Field [137]. The setting of blind inverse problems is a

natural candidate for the application of the Bayesian probabilistic methodol-

ogy, allowing us to incorporate ‘soft’ constraints in a natural manner, through

Bayesian priors (Tarantola, [162]; Idier, [92]; Calvetti and Somersalo, [25]).

In the Bayesian framework, one can flexibly encode weak assumptions about

the object of interest, such as smoothness, positivity, etc. (Sivia, [156]). Li

and Speed [110] note that the idea of a ‘well-defined statistical model’ is the

statistical counterpart of the notion of well-posedness from a probabilistic

perspective. A well-defined statistical model should have properties such as

identifiability of the unknowns, existence of an unbiased estimate, and stabil-

ity of the estimates in the statistical sense, e.g. variance, robustness, etc. By

putting probabilistic constraints on the unknowns, the inherent ill-posedness

of many inverse problems can be overcome. As an example of a parallel

between the classical and statistical view of regularization, the equivalent of

Tikhonov regularization is the method of ridge regression in statistics. In

general, any variational regularization scheme of the form of Eq. (3.1) is for-

mally equivalent to Bayesian inversion (Borchers, [20]), with the second term

corresponding to a prior distribution on the unknowns. The interpretation

of the solutions, however, is different. Instead of a single “best estimate”

125

of the classical solution (the minimizer, x⋆), the Bayesian approach infers a

probability distribution over all possible solutions, x, given the data, i.e.

a ‘posterior’, P (x|y). This is the Bayesian solution to the statistical inverse

problem [25]. Only then may one choose the most probable solution under

the model, for example, as the “representative” of the ensemble of all posible

solutions. Note that in this text we maintain the posterior distribution over

the latent variables throughout inference, however. The Bayesian approach

to inverse problems is elaborated in section 4.2.

Other related work

In the statistics community, several authors have proposed statistical ver-

sions of classical deterministic regularization methods. O’Sullivan [60] gives

a statistical perspective on ill-posed inverse problems and suggests the use of

the mean-square error as a tool for assessing the performance characteristics

of an inversion algorithm. He bases his method on extensions of the Backus-

Gilbert averaging kernel method, used in geophysics. He also proposes some

experimental design criteria based on these ideas. Fitzpatric [65] gives a

Bayesian analysis perspective on inverse problems and relates the ‘Bayesian

maximum-likelihood’ estimation to Tikhonov regularisation. He proposes

the expectation-maximisation algorithm to the problem of setting regular-

isation levels, and develops a framework for infinite-dimensional Bayesian

analysis on Banach spaces. Li and Speed [110] give a statistical counterpart

of the notion of well-posedness in the idea of a well-defined statistical model

(which should possess identifiability of the unknowns, existence of an un-

biased estimate, or something close to this, and reasonable stability of the

estimates in the statistical sense, i.e. variance, robustness, etc.), and develop

a deconvolution method for positive spikes, the synchronisation-minimisation

algorithm, based on the Kullback-Leibler divergence. Calvetti and Somersalo

126

[25] especially emphasize the relationship between inverse problems and sta-

tistical inference under the Bayesian framework. This is actually a two-way

interaction, in that interesting inverse problems pose significant computa-

tional challenges, leading to novel Bayesian analysis, on the one hand, and

Bayesian analysis provide “priorconditoners” for stabilizing the ill-posedness

of inverse problems, on the other. Finally, Ghahramani studied the solution

of inverse kinematic problems using the EM algorithm [72].

Proposed Method

In this Chapter we revisit sparse blind source separation and cast it into

a stochastic framework, in which the unknown source signals are modelled

as realizations of stochastic processes. We view blind inverse problems

from a statistical data modeling perspective (Attias, [6]), and the search for

their solution as a statistical inference problem (Abramovich and Silverman,

[1]). We adopt the Bayesian framework (Sivia, [156]) as the foundation for

modelling and inference and for characterizing uncertainty in the parameters

of the model.

The wavelet-based SCA model presented in the previous Chapter can also

be interpreted within the probabilistic framework. Based on the functional E ,we present here the derivation of the corresponding probabilistic model. This

will enable us to build a hierarchical model and employ Bayesian inference in

order to estimate the above mentioned and other model parameters from the

data itself, instead of setting them by hand1. It will also allow us to estimate

error bounds and assess the uncertainty in the model. While the model of

the previous Chapter provided only point estimates of the unknown signals,

the Bayesian model proposed in this Chapter incorporates uncertainty in the

data and the model into the estimation algorithm.

1Other approaches include the L–curve method, the discrepancy principle, and cross-validation.

127

We propose a generative probabilistic model for blind inverse problems

under sparsity constraints based on a variational lower-bound maximization

approach. While our method is more generally applicable, we discuss in

detail the particular inverse problems of

• Separating signals and datasets into their constituent components. This

is the multivariate inverse problem of blind source separation, also

known as the ‘cocktail party’ problem.

• Extraction of weak biosignals in noise, with applications to biosignals

such as fMRI and MEG. Our main interest/operational application of

this work is especially in the exploratory analysis of neuroscientific data

by decomposing functional MRI images into components.

The method, however, is not tied to source separation and can be applied to

a variety of linear blind inverse problems under sparsity constraints.

For the inversion/solution of the above problems we propose the method

of Bayesian Iterative Thresholding (BIT). This algorithm exploits the spar-

sity of latent signals to perform blind source separation and signal reconstruc-

tion. The key to our approach is the development of a generative probabilistic

model that captures these properties using Bayesian priors, and which will

be solved using Bayesian inversion ideas. The algorithm will be derived from

lower-bounding the data likelihood, and will result in a variational EM-type

algorithm that alternates between the estimation of the sources and the mix-

ing matrix.

128

4.2 Bayesian Inversion

Computation of inversion in both the traditional and Bayesian approaches2

usually leads to solving a system of equations. For example, in the deconvo-

lution problem, discretizing the Fredholm integral equation of the first kind

(which is the mathematical model describing the physical process) leads to a

system of linear equations, which we want to invert in order to estimate the

object of interest. In this text we consider the linear measurement model

y = Ax + ε , (4.1)

where the ‘forward’ linear operator A maps the unknowns3, x, to the obser-

vations, y. The observation operator links the latent and observation spaces,

X and Y , respectively. In the Bayesian approach to inverse problems, we

model the signals x and y as random variables. Note that in the two inter-

pretations to the model, determinstic and probabilistic, the interpretation

of the term ε is not the same conceptually: in the Bayesian approach ε is

considered additive random noise, while in the classical method ε is an “er-

ror” term that represents some kind of distance between the data, y, and

the image under A of the “true”, fixed value of the unknown x, y = Ax. In

the Bayesian point of view, randomness reflects our incomplete information

about the system and the measurements.

The Bayesian way of encoding our knowledge or expectations about the

random quantities x and y are by assigning probability distributions to them.

In particular, we specify the prior model, p(x|θ;Θ), and the forward/noise

2As a point of nomenclature, we note that some authors (see e.g. Careira-Perpinan [29])note that the terms ‘inverse problem’ and ‘mapping inversion’ are not exactly equivalent inclassical regularization. Under the Bayesian inference framework, however, the differencebecomes less important, since both the latent variables and the parameters of the mappingtake priors and are treated on an equal footing as part of the estimation.

3The unknowns are usually referred to as “the model” in inverse problems theory anddenoted by m.

129

model p(y|x, θ;Θ), where θ are the model parameters (including the linear

operator A) and Θ are the so-called hyperparameters, higher-level parame-

ters which control the distributions of other parameters. The prior model

captures our degree of belief about a particular value for the unknowns being

“correct” based only on our prior knowledge about the system. The forward

model represents the likelihood, that is the probability of observing y given

a realization of the random unknowns and parameters, p(y|x, θ;Θ). It cap-

tures our belief regarding the degree of accuracy with which the model can

explain the data (“If we knew the unknowns and parameters exactly, what

would the distribution of the data be?”). Given the measurement model, it

is a measure of the “data misfit”. Under Eq. (4.1), it is functionally equal to

pε(Ax− y).

The likelihood describes the measurement part of the model. The prior

contains all information and prior knowledge we have about the unknowns.

Bayesian statistics provide a simple yet extremely powerful way to combine

these two pieces of information together, the Bayes’ rule:

p(x|y, θ;Θ) =p(x|θ;Θ)p(y|x, θ;Θ)

p(y|θ;Θ). (4.2)

In words, the posterior probability of the unknowns, after having measured

the data, is computed by “modulating” our prior model by the likelihood, and

scaled by the denominator. Bayes’ rule is essentially a recipe for updating our

knowledge about a system by observing some measurables, linked together

by a mathematical model. It can also be seen as the mathematical process

of “reversing” the conditional probability of the data given the parameters,

to give the conditional probability of the parameters given the data. The

quantity in the denominator, the evidence for the data, apart from being a

normalization factor ensuring that the left-hand side is a proper probability

density, reflects the fact that, by computing it, we have in effect “integrated

130

Figure 4.1: The Bayesian solution to an inverse problem is a posterior distri-bution. Left: the classical solution corresponds to Pr(x|d) = δ(x−x⋆); Right:Bayesian solution, Pr(x|d), where d is the data. The Bayesian solution takesinto account uncertainties in the model and the measurements.

out” the randomness in the unknowns, x:

p(y|θ;Θ) =

ˆ

dx p(y|x, θ;Θ)p(x|θ;Θ) . (4.3)

We now summarize the main mathematical objects of classical and Bayesian

inversion in table 4.1. In this table, ID(x;y, θ) and IP (x, θ) in the classical

Approach Given Find

Classical inversion y, ID(x;y, θ) and IP (x, θ) x⋆

Bayesian inversion y, P (y|x, θ,Θ), P (x|θ,Θ) and P (θ|Θ) P (x|y)

Table 4.1: Summary of the main mathematical objects of classical andBayesian inversion.

approach are the distortion measure and penalty, respectively, which may

also contain parameters, θ. To reiterate, the Bayesian solution to the sta-

tistical inverse problem is the posterior density. A schematic interpretation

of the posterior as a distribution with width that is a function of the noise

precision and the prior width is shown in Fig. 4.1. The exact functional form

of the posterior for the problem treated here will be derived later in this text.

131

4.3 Sparse Component Analysis Generative

Model

4.3.1 Graphical Models

We now introduce a graphical modelling formalism for generative models [7].

This formalism allows the modular specification of probabilistic models in

an intuitive, visual way. The probabilistic relationships among the various

elements of the model can be represented as a directed acyclic graph (DAG).

In particular, this graph represents the structural relationships among the

observed data, y (gray node), latent variables, x, and model parameters, θ,

as well as the controlling hyperparameters4 of their distributions, Θ.

Graphs such as the above can be endowed with probabilistic semantics

such as they reflect the probabilistic relationships in a generative model in a

one-to-one fashion. This is done as follows: each directed arc denotes a prob-

abilistic dependence relation between the nodes at its head and tail, named

‘parent’ and ‘child’, respectively. For example x −→ y means “y depends

on x”. This dependence relation is encoded in the conditional probability

p(y|x, · · · ), where ‘· · · ’ here denotes all other parent nodes of y. Combining

all these local relations, for every variable involved, results in a complete

graphical representation of a probabilistic model. We note here there are

three fundamental structures/building blocks in such a DAG, which follow

from, and are consistent with, the above locality property: convergent, di-

vergent, and chain5. Of these three, we will mostly need the convergent and

the chain structures for the model discussed in this Chapter.

The DAG representation for a general latent variable model is shown in

4From the point of view of hierarchical Bayesian modelling, the difference between anode in θ and one in Θ is that the former is assigned a prior probability density. Ultimately,it is the modeller’s choice which variables belong to which set.

5See ISPRS Notes and Appendix 2.1.

132

Θ θ x y

Figure 4.2: A directed acyclic graph (DAG) representation for generativelatent variable models, showing the structural relationships among the ob-served variables, y, latent variables, x, model parameters, θ, and hyperpa-rameters Θ.

Fig. 4.2. We will denote the ‘inner graph’, comprising of the nodes x,y, θ,by G and the augmented graph G ∪ Θ, containing the ‘root’6, Θ, as well,

by G. Such probabilistic graphical models can be learnt via the variational

lower bound maximization method described in section 4.4.

4.3.2 Construction of a Hierarchical Graphical Model

for Blind Inverse Problems under Sparsity Con-

straints

In this section we show how the regularization approach described in the

previous Chapter can be reformulated within a Bayesian probabilistic frame-

work. Conceptually, there are four steps in the derivation of the Bayesian

iterative thresholding approach:

1. Interpret the energy functional E as the cost function of a probabilistic

model,

2. Introduce a latent random variable, S, for the unknown signals,

3. Lower bound the likelihood of data and model parameters,

6A root node is a node without any parents.

133

4. Derive a learning algorithm that maximizes this bound, iteratively in-

ferring the unknown signals and learning the model parameters.

Based on the energy functional, E , we introduce our Bayesian generative

model for blind inverse problems under sparsity constraints as a latent vari-

able model here. This requires the interpretation of constraints as prior

probability distributions. The corresponding learning algorithm will be de-

rived from a lower-bound maximization approach7, by bounding the data

likelihood from below. The estimation equations for each of the variables in

the model and the computation of the lower bound itself will be presented

in the next section.

The key to our approach is the specification of a hierarchical latent vari-

able model for blind inverse problems, such as blind source separation. We

start from the generic form of the standard linear data decomposition of a

dataset,xd

D

d=1, into a set of latent signals,

sl

L

l=1, which is written in

matrix form as

X ≈ AS ,

while the graph corresponding to this problem, in its generative form, is

shown in Fig. 4.3a and is denoted by G ≡ G0. Focusing only the ‘complete

data’, (X,S), the initial graph is S −→ X. In order to make this model well-

posed appropriate constraints should now be imposed. The initial component

analysis graphical model will undergo a series of structural transformations

(Buntine, [22]) that will produce the final hierarchical SCA model. The

changes to the standard component analysis model are best described using

the graphical formalism introduced in subsection 4.3.1. These are shown in

Fig. 4.3b–e and will be discussed next.

7Olshausen also uses a lower bound formulation in [136], by introducing a variationalposterior, Q(ai), over the coefficients of the sparse representation, ai, (i.e. the correspond-ing “sources”). However, he does not consider the ai as latent variables and proposesa gradient optimization approach instead. Moreover, his model is not hierarchical, but

134

(1)

(3)

(2)(4)

A

A A

v

A

X

C

A

G0

G1

G2

(e)

v

XΣ

X

C

S

(a)

ΣX

X

S

X

(b)

C

S

XΣ

Φ

X

S

C

(d)

S

ΣX

Φ

(c)

ΣX

Φ

Φ

Figure 4.3: Structural transformations of the standard component analysisgraphical model, G0, shown in (a). (We only focus on the essential subgraphthat is relevant to the discussion here and, in a sense, forms the “spine” ofthe model.) If a relation between two variables is deterministic, we groupthe couple within a dotted box and denote the node at the tip of the cor-responding edge, i.e. the ‘children’ node, with a double circle. That nodecan be deleted from the graph, giving the equivalent graphs in the secondrow. This transformation is denoted by a dotted red arrow. Adding nodesto a graph is denoted by a continuous-line red arrow: operation (1) adds awavelet node, C; operation (3) adds a source noise covariance node, v.

135

In terms of generative modelling, the connection between smoothness

spaces and wavelets, introduced in subsection 3.2.2, suggests that one can

use a ‘parametric free-form’ solution (Sivia, [156]) based on wavelets in order

to synthesize the unknown signals8. In particular, the foundation of our

approach is the expansion of the latent signals in a wavelet basis, Φ, as

S.= S

def=(ΦCT

)T. (4.4)

This step introduces the structural constraint of equation (3.10) in the graph

G. As in the deterministic method, this captures the constraints of spar-

sity and (generalized) smoothness. The wavelet coefficient matrix, C, is a

parameter in θ.

In terms of the graphical model of Fig. 4.3, the above substitution is

equivalent to replacing node S by the pair of nodes (C, S) via operation (1),

where the two nodes are deterministically related by equation (4.4). After

this expansion, the graph becomes9 θ −→ X, i.e. C −→ X, and is shown in

Fig. 4.3b or, equivalently, in Fig. 4.3c.

At this stage, the optimization functional related to the graph of Fig. 4.3b,

when combined with the corresponding penalty term of Eq. (3.9) on C and

the growth-limiting term on A, corresponds exactly to the energy function

E introduced in equation (3.11).

Bayesian Interpretation

Under a probabilistic framework, the functional E can be equivalently in-

terpreted as a maximum a-posteriori (MAP) cost function. Following a line

of reasoning similar to that of Harpur and Prager [80] and Olshausen [136],

two-layer, and he uses a maximum likelihood framework.8This is also reminiscent of the ‘functional data analysis’ idea [144].9Again note that θ = C,A, but we mainly empasize C here.

136

from a generative point of view one can interpret the error term, EX, in the

above energy formulation as an implicit observation noise.

Data model and likelihood. In particular, the data is modelled as a

linear mixture of underlying sources with additive random sensor noise, EX

via observation equation

X = AS + EX . (4.5)

The sensor noise is assumed to be a matrix-variate normal (MVN) with

zero mean and row covariance matrix ΣX: EX ∼ N (0, (ΣX, IN)). Based

on the above, the likelihood for parameters and sources under the model

(i.e. the probability of the data conditioned on the sources, given the model

parameters) is

P (X|S,A,ΣX) =1

det (2πΣX)N/2exp

−1

2tr[

(X−AS)TΣ−1X (X−AS)

]

.

(4.6)

The ‘fidelity’ term of the energy-based method, ‖X−AS‖2, then becomes

the negative log–likelihood function of the observations given S and A, and

gets a prefactor − 12σ2

X

, where ΣX = σ2XID is the noise variance. The corre-

spondence is

P (X|S,A,ΣX) =1

ZX

e−EX ,

where

EX =1

2σ2X

‖X−AS‖2 and ZX =(2πσ2

X

)ND/2.

Priors. Under the Bayesian framework, both parameters C and A will be

assigned a prior distribution. In anticipation of the generative formulation,

the penalty terms are written as log–priors in the MAP objective. Using the

expansion of equation (4.4), the optimization problem can then be written

137

as [176]

max(C,A)

J (C,A), (4.7)

with

J def= logP (C,A|X)

= − 1

2σ2X

∥∥∥X−A

(ΦCT

)T∥∥∥

2

+∑

l,λ

log(e−βlSD(cl,λ)

)+ log

(

e−12a‖A‖2

)

+ const.,

where SD(cl,λ) is the sparsity function on cl,λ from equation (3.9), with

cl,λ =(Φ−1sl

)

λ, and the third term is a quadratic prior on the mixing matrix

as per equation (3.11). In the second and third terms, we made explicit the

dependence on the width (‘scale’) hyperparameters, βl and a, respectively.

The last term is a constant coming from log–normalizing factors that do not

depend on either C or A, and can be dropped from the equation. We finally

note that the ‘hyperparameters’ βl and a of the source and mixing models are

implicitly incorporated in the Lagrange multipliers λC and λA if we define

the correspondences10 2σ2Xβl → λC and σ2

Xa → λA. As discussed in sub-

section 3.2.2, these hyperparameters have a profound impact on the value of

the optimization functional. Therefore, here we propose a more principled

method in order to estimate them directly from the data.

The last form of our optimization functional formally coincides with the

sparse optimization problem of Zibulevsky and Pearlmutter [176], Eq. (3.9),

p. 869. Zibulevsky and Pearlmutter formulated the same probabilistic model

using an analogous reasoning. However, the optimization there was per-

formed directly on the J (C,A)–surface, under the constraint ‖al‖ ≤ 1,

l = 1, . . . , L, using gradient descent. Our solution will be different, lead-

ing to a Bayesian extension of the iterative thresholding algorithm for linear

10The quantity σX

1/√

βis often called the ‘Tikhonov factor’.

138

inverse problems of Daubechies et al. [45], as will be explained next.

Learning. In MAP estimation one wants to maximize the joint probability

of data and parameters11

f(θ)def= p(X, θ) . (4.8)

One could now optimize this objective directly with respect to the vector

of parameters, θ, via a gradient-based algorithm, for example, to get the

optimum value

θMAP = argmaxθ

log p(X|θ) + log p(θ)

.

Since C is a parameter, one can obtain an estimate of the wavelet coeffients of

the sources this way. The sources can then be estimated via Eq. (4.4). This

is the approach followed by Zibulevsky12 [176]. Here, however, we adopt a

Bayesian approach by marginalizing out the unknown source signals. That is,

we consider the unknown signals to be random latent variables. This requires

the specification of a new, generative model for SCA, discussed next.

Hierarchical SCA latent variable model

A ‘hidden’ variable, h (where h is the source node, S, in this case), can

be introduced in the graph G as shown in operation (3) of Fig. 4.3. The

deterministic node, S, introduced previously, becomes a parent of S and,

as will be shown later13, in the Bayesian framework, represents a kind of

mean “message” sent to S by S. The node S can be deleted from the final

graph, Fig. 4.3e, but can be retained in the equations, if needed. (The latent

11The quantity f is equivalent to L for the data likelihood. It is going to be replacedby the joint probability over data, latent variables and parameters, X,h,θ, next.

12Zibulevsky et al. [177] also propose the use of a uniform prior on the mixing matrix,A, so the corresponding log–prior can be dropped.

13See equations 4.24 and 4.25.

139

variable S also depends on a covariance node, v, which is also introduced in

G as a parent of S. The reasoning behind its inclusion will be discussed in

subsection 4.3.3.)

The random variable h is assigned a probability density, p(h). The joint

probability of data and parameters from equation (4.8) can now be rewritten

as

f =

ˆ

h

p(X,h, θ︸︷︷︸

G

)dh ,

integrating out the latent variables (Minka, [127]). It is worth noting that

here we have a density in the form of a mixture of models, p(X, θ|h), weighted

by the probability of h. This way, the uncertainty in h is “folded into the

estimation process”. This is a natural formulation for component analysis

models, where the data are “explained” by a set of hidden variables that

form an “abstraction” of the data according to some desired criterion, such

as minimum variance, dependence, or sparsity; cf. the graph of Fig. 4.3a.

This gives the likelihood for the observations, L(θ), in the form

p(X|θ) =

ˆ

dS p(X|S, θ)p(S|θ) ,

where the likelihood for parameters and sources is given in Eq. (4.6) for the

SCA model. The joint distribution of data and model parameters is then

P (X, θ) =

ˆ

dSP (X,S, θ) =

ˆ

dSP (S|θ)P (X|S, θ)P (θ) ,

where, again, we have integrated over all possible values of the latent variable

S. The factorization of the joint probability of X,S, θ inside the last

integral follows from the Markov properties of the graphical model, G, shown

in Fig. 4.2.

The final DAG for the latent-variable sparse component analysis model is

140

shown in Fig. 4.3d, or, equivalently, in Fig. 4.3e. This figure shows wavelet-

based SCA as a hierarchical graphical model. The “backbone” of this model

is the top-down Markov chain C −→ S −→ X, capturing the generative pro-

cess from the wavelet coefficients to the sources to the observations. In the

Bayesian graphical modelling framework, the whole subgraph that ends in a

node is actually the prior for that node. For the SCA model this means that

we regard the unknown source signals, S, as being generated by a wavelet

synthesis procedure, whose building blocks are elements ϕλ ∈ Φ, and this

procedure is part of its prior.

We can now summarize the generative process for the data as follows.

A set of basis functions, Φ = ϕλΛλ=1, is used to synthesize L unobserved

source signals, slLl=1, with corresponding wavelet coefficients clLl=1, where

cl = (cl,1, . . . , cl,Λ), as

S ≈(ΦCT

)T.

These sources in turn are mixed through a mixing matrix A =[Adl

]to

generate the observation signals xdDd=1, with xd = (xd,1, . . . , xd,N):

X ≈ AS .

In signal space, the mappings among the various signals are14

ΦC−→ S

A−→ X .

Note that, due to the presence of noise, these are stochastic dependencies (At-

tias, [6], [7]); see next. The collection θ = C,A comprises the parameter

set of our model and will be learned from the data. We call the corresponding

model dictionary-based sparse component analysis (Db-SCA).

14All signals are in RN , where N is the number of data points, but the wavelet basesϕλ have compact support.

141

The DAG, however, only shows the structural relationships among vari-

ables in the model. The model specification will be completed and made

more precise in subsection 4.3.3, by the introduction of specific probability

distributions for the various random variables in the model. Probabilistic

(“soft”) constraints are encoded in the form of prior distributions on the

various random variables in the model, capturing our domain knowledge or

modelling decisions. A central one for this model is that we aim at modelling

and analyzing signals that are sparse in some appropriate basis, or frame.

This will be captured by using an appropriate form for the prior on the

expansion coefficients of the latent signals, p(cl,λ|αl, βl), in subsection 4.3.3.

4.3.3 Latent Signal Model: Wavelet-based Functional

Prior

A signal model in the probabilistic framework is composed of a prior density,

combined with any functional relationships among the variables. A key point

of our approach is that the sources are regarded as latent random variables,

depending on the wavelet bases in a non-deterministic way.

In the probabilistic model we will let the “states”, S, (adopting Field’s ter-

minology) be noisy by introducing a state noise variable, ES, on the sources

and model the sources as

S =(ΦCT

)T+ ES , (4.9)

where Φ is a N×Λ matrix that stores the basis functions, ϕλ, in its columns.

The relation between C and S, which was deterministic in Eq. 4.8 and in

Zibulevsky et al. [176], becomes stochastic here. This formulation serves

several purposes:

142

• It models ‘system’ noise that may be present in inherently noisy source

signals, such as biosignals and brain processes15. In particular it may

be valuable in the separation/extraction of ‘activations’ investigated by

MEG, EEG or fMRI, for example.

• It models approximation errors and residuals from the wavelet synthesis

model; this is especially useful for the case of transforms that do not

offer perfect reconstruction or in truncated wavelet transforms.

• The use of the functional prior with Gaussian source noise simplifies

inference as well. In particular, the use of a rich dictionary of functions,

such as a wavelet dictionary, enables us to transfer the complexity of

modelling the sources in the upper layer of the network of Fig. 3.10.

Moreover, it enables us to compute the data likelihood analytically:

the node S decouples X from C, as it will explained next. The sources

conditioned on their parents are Gaussian, therefore it is easy to com-

pute the posterior sufficient statistics of the sources, 〈S〉Q and⟨SST

⟩

Q,

needed for learning the model parameters (see Eqns (4.24) and (4.26)).

Thus, it enables us to use the ‘complete data’ concept in the VEM

algorithm, making inference tractable.

The practical result of including a state noise model in the prior is that the

algorithm infers a set of denoised (filtered) sources, by attributing a portion

of system noise to the corresponding variable.

By judiciously chosing the characteristics/properties of the family/basis

(those properties that better match our prior notion of importance), certain

features/attributes of the signals can be emphasized. We note that there is no

15There are many sources of ‘noise’ in such signals, such as the inherent stochasticity ofneural sources and the fact that what we actually measure are some macroscopic physicalquantities and not the actual neural activity. This, combined with the fact that neuralsignals are very weak, compared with other sources, makes signal separation in neuralstudies quite hard a task.

143

universally “best” dictionary as such: wavelet families can be tailored to suit

specific needs and trade-offs, such as compact support versus smoothness16.

Nevertheless, in many cases, families have been found to give very good

performance for a variety of problems. In any case, our framework is general

and can accomodate any family.

The latent signal model for the l–th source of the functionally-constrained

SCA model can be summarized as:

l :

sl = Φcl + εsl, Φ =

[ϕ1, . . . ,ϕλ, . . . ,ϕΛ

],

P (cl|αl, βl),

P (εsl|vl),

(αl, βl), vl,

(4.10)

where the corresponding rows in equation (4.10) are: the functional model

and the wavelet synthesis matrix, the sparse prior on the expansion coef-

ficients, the ‘state’ noise model, and the hyperparameters controlling the

sparsity prior and state noise.

The source noise, ES, is assumed to be Gaussian with diagonal covariance

matrix V = diag(vl). The system noise model is therefore

l : P (εsl|vl) = N (0, vlIN), (4.11)

Finally, the density P (cl|αl, βl) captures the sparsity constraint, and will

be elaborated next.

16Mallat gives a vivid analogy between the use of wavelet bases/dictionaries and filteringout irrelevant information as “noise” by humans in noisy environments [121]. In a sense,this is exactly what source separation is about.

144

Modelling Sparsity: Heavy-tailed Distributions

The use of wavelet expansions result in parsimonious representations of a

broad class of signals. Sparsity is reflected in the characteristic shape of

the empirical histograms of wavelet coefficients, exchibiting distributions

highly peaked at zero. Indeed, many types of real-world signals exhibit this

behavior—see subsection 3.2.2 for an analysis of which types of signals are

expected to have a very sparse representation in wavelet bases. This is also

the key for the successful use of wavelets in signal compression. One can ar-

gue that the underlying physical reason is indeed ‘shortest description’ [175],

[96]. In addition to theoretical analyses (see eg. Donoho [50]), numerous ex-

perimental studies have provided additional evidence for sparsity (Simoncelli,

[155]; Moulin and Liu, [129]; Jalobeanu, [96]).

In the SCA model, the sparsity constraint for the sources is encoded in the

model of their wavelet coefficients, p(cl,λ|θcl), in combination with our choice

of wavelet basis, Φ. In the probabilistic framework, the weighted ℓp prior

used in the energy functional, E , corresponds to the generalized exponential

distribution,

P (cl,λ|αl, βl) =1

2Γ(1/αl)

αlβ1/αll

exp(

− βl|cl,λ|αl

)

, λ ∈ Icl. (4.12)

The use of a supergaussian prior, which puts the bulk of the probability

mass around zero, is a means to control the complexity of the representation.

By also letting the controlling hyperparameters, (αl,βl), be estimated from

the data, the generalized exponential model has proven a versatile tool for

modelling a wide class of signals.

Figure 4.5 demonstrates the excellent fit of the generalized exponential

density to the empirical distribution17 of a wavelet coefficient sequence of a

17The bin-width for the histogram can be estimated estimated using e.g. the Freedman-

145

Figure 4.4: An image exhibiting smooth areas and structural features. Satel-lite image of Amiens, France, downscaled to 256× 256 pixels — c© IGN.

natural image, a satellite image of Amiens, France (downscaled to 1024×1024

pixels), shown in Fig. 4.4.

4.4 Inference and Learning by Variational Lower

Bound Maximization

In the iterative thresholding algorithm of Daubechies et al. [45], a sequence

of surrogate functionals was constructed such that the problem of optimiz-

ing the original functional, I[s], is replaced by the easier problem of it-

eratively optimizing the sequence

I[s; zi]

iof simpler functionals, where

zi is the minimizer of the surrogate functional at the previous iteration,

zi ← s⋆ = argmins I[s; zi−1].

Our method derives an iterative thresholding algorithm for blind inverse

problems in the probabilistic framework. This was driven by the need to

take uncertainty into account and to estimate critical quantities and param-

eters of the model, such as the value of the threshold and the observation

Diaconis formula [68], bin size = 2 iqr(x)N−1/3, where iqr(·) is the interquartile range ofthe data sample, x = xn, and N is the number of data points in x. Note, however, thatthe histogram is used only for display purposes.

146

0

0.1

0

Empirical histogram of ck GE(c; µ=0.0, α∗ =0.75319, β∗ =13.870)

Figure 4.5: Heavy-tailed distributions for wavelet coefficients. Generalizedexponential densities with learned hyperparameters can have an excellent fitto experimental data for a wide range of problems. This graph shows onesuch fit (with α = 0.753, β = 13.870) to the sparse wavelet decomposition,cλ, of a natural image, at a particular scale (the satellite image of Amiensof Fig. 4.4, at scale j = 7, using the ‘Symmlet–8’ basis).

noise variance, from the data. An additional goal was to adapt to the spar-

sity characteristics of the unknown signals, by learning the strength of the

penalties (shape and width hyperparameters of the Bayesian priors). Now,

in the probabilistic framework, an analogous variational approach can be

used for intractable or computationally demanding models. The main idea

is to approximate/replace the data likelihood by a simpler form that is more

tractable. As shown in section 4.4.1, this new functional is a lower bound

to the likelihood, and plays a role analogous to the surrogate functionals

of DDD. As discussed in subsection 4.2, the Bayesian answer to an inverse

problem question is a posterior probability density. This object is not always

easily computable, though. This section derives a tractable approximation

to the posterior that, at the same time, minimizes the distance between the

data likelihood and its lower bound. We envisage to apply our model to

large datasets, such as biosignals. The problem becomes variational due to

the probabilistic assumptions about the structure of interactions of the data

points in the variational posterior. This will be discussed in more detail in

147

subsection 4.4.1.

4.4.1 Variational Lower Bounding

Consider a model with hidden variables x, observed variables y, and model

parameters θ, parameterized by a set of hyperparameters Θ, whose general

structure is as in Fig. 4.2. Our aim is to maximize the function f(θ) =

p(y, θ), i.e. the joint distribution of data and parameters.

Now, for any valid probability distribution, Q(x), on the latent variables,

x, the functional

F[θ, Q(x)

]=

ˆ

dxQ(x)p(x,y, θ)−ˆ

dxQ(x) logQ(x) (4.13)

forms a lower bound to f(θ). To see this, we take the joint distribution again

and write it as

log p(y, θ) = log

(ˆ

dx p(x,y, θ)

)

= log

(ˆ

dxQ(x)p(x,y, θ)

Q(x)

)

.

To obtain the lower bound, we apply Jensen’s inequality18 and move the

logarithm inside the integral:

log p(y, θ) ≥ˆ

dxQ(x) log

(p(x,y, θ)

Q(x)

)

= F(θ, Q)

The difference between the lower bound, F , and the joint probability, f , is

the Kullback-Leibler divergence between the trial distribution Q(x) and the

posterior P (x|y, θ), KL(Q||P ) ≥ 0, which is a measure of the “distance”

18Jensen’s inequality isϕ (E [X ]) ≤ E [ϕ(X)] ,

where ϕ a convex function and X is an integrable random variable.

148

between those two distributions:

KL[Q(x)||P (x|y, θ)

]=

ˆ

dxQ(x) log

(Q(x)

P (x|y, θ)

)

.

Indeed, by writing F(θ, Q) as

F(θ, Q) =

ˆ

dxQ(x) log

(p(x,y, θ)

Q(x)

)

=

ˆ

dxQ(x) log

(p(x|y, θ)p(y, θ)

Q(x)

)

=

−ˆ

dxQ(x) log

(Q(x)

p(x|y, θ)

)

+ log p(y, θ) =

−KL(Q||P ) + log p(y, θ) ,

we finally get

log p(y, θ) = KL[Q(x)||P (x|y, θ)

]+ F(θ, Q) ≥ F(θ, Q)

We want to find the distribution Q(x) that makes this bound as tight

as possible. By functionally differentiating F[θ, Q(x)

]with respect to Q(x)

under the normalization constraint´

dxQ(x) = 1,

∂

∂Q(x)

F[θ, Q(x)

]+ λ

(ˆ

dxQ(x)− 1

)

= 0 ,

we find that the optimal variational distribution is gotten if

Q(x) = P (x|y, θ) , (4.14)

i.e. if it is set to the true posterior distribution over the latent variables. The

Kullback-Leibler divergence KL(Q||P ) is also zero when Q = P and positive

otherwise. Therefore the functional F[θ, Q(x)

]is a strict lower bound to

149

log p(y, θ). Thus, minimizing F[θ, Q(x)

]with respect to the variational

distribution Q(x) is equivalent to minimizing the Kullback-Leibler divergence

KL[Q(x)||P (x|y, θ)

]between F and f .

Note that we only derived the general form of the lower bound, F , here,

in order to make the connection with the variational approach of DDD (and

introduce the latent variable S). The derivation of the NFE for the SCA

model will be done in section 4.4.3.

The variational algorithm for MAP estimation therefore uses the lower

bound F , also called the (negative) ‘variational free energy’ (NFE), as opti-

mization functional,

F[

Qx(x),y, θ,Θ]

=⟨

logP (x,y, θ︸︷︷︸

G

|Θ)⟩

Q+H

[

Qx(x)]

, (4.15)

where the density P is parameterized (conditioned) in terms of the hyperpa-

rameters, Θ. In the above equation, the first term is the average ‘variational

energy’ and the second term is the entropy of the variational posterior, Q.

The average, 〈·〉Q, is over the latent variables, x, with respect to their pos-

terior, Q. In the left-hand side, we have made explicit the dependence on all

relevant variables.

The first term, in turn, decomposes into two terms: the (negative) ‘energy’

of the global configuration of the system of observed and latent variables,

x,y, comprising the ‘complete data’, which is parameterized by θ and Θ

and defined to be

Eθ,Θ(x,y)def= − logP (x,y|θ,Θ) ,

and the log–prior of the parameters, logP (θ|Θ).

Note that unlike the optimization function of the EM algorithm for maxi-

150

mum likelihood estimation (the so-called ‘Q–function’), the variational MAP

F–function depends on all the elements,x,y, θ

, of the graph (model) G

(see Fig. 4.2), allowing the incorporation of priors on the parameters. This

is essential for our model, because we want to impose the sparsity constraint

on the representation of the latent signals in a signal dictionary.

The negative free energy, F , serves two purposes: (a) It allows us to

derive the estimation equations for the unknown quantities in the model,

and (b) It allows us to monitor the convergence of the algorithm.

Due to the Markovian properties of the graphical model G, which trans-

late to additivity in log–space, the NFE is decomposed in terms involving

individual contributions from each element of G. The expression of F for

SCA is computed in Sect. 4.4.3.

Inference and Learning

Variational lower-bounding of p(y, θ) by F[θ, Q(x)

]leads to an estima-

tion algorithm that iteratively finds a good bound, i.e. minimizes the the

Kullback-Leibler divergence KL[Q(x)||P (x|y, θ)

], and subsequently opti-

mizes this bound, with respect to the model parameters.

Our algorithm is structured around two nested phases: an inner phase

involving the estimation of the latent variables and model parameters and an

outer phase of the estimation of hyperparameters. These are repeated for a

number of ‘epochs’19 until convergence. After convergence, they are followed

by a source reconstruction stage.

The computations of the inner phase can be summed up as: “Integrate

over the latent variables and optimize over the parameters”. This is a varia-

tional version of the Expectation-Maximization algorithm (Neal and Hinton,

19An epoch is defined as a batch of iterations during which the hyperparameters areheld constant and only the parameters and latent variables are updated. After eachepoch, a hyperparameter update is performed and the internal loop is run again until(local) convergence; see Alg. 3.

151

[132]; Minka, [127]). We estimate the model parameters, θ, (learning) in the

variational M–step as

θ = argmaxθ

ˆ

dx Qx(x)[

logPΘ(x,y, θ)]

. (4.16)

The variational posterior over latent variables (inference) is obtained in the

variational E–step by the functional optimization:

∂

∂Qx(x)F[

Qx(x);y, θ,Θ]

= 0 ⇒ Qx(x) = P(

x∣∣∣y, θ,Θ

)

, (4.17)

using the results obtained at the previous iteration for the rest of the model.

In practice, this step is equivalent to computing the sufficient statistics of

the distribution Qx(x).

The assignment in equation (4.17) leads to the variational distribution

that makes the lower bound tight. However, using the exact posterior as Q

does not buy us much if it leads to intractable computations. The bound

in equation (4.13) is valid for any variational distribution Qx(x). To be of

practical use, we must choose a distribution that will allow us to evaluate the

lower bound (4.13) and the estimation equations (4.16) and (4.17). There is

a trade-off to be made, therefore, between tractability and quality of approx-

imation. Subsection 4.4.1 discusses the issue of inference in large datasets.

For the SCA model in particular, the node X comprises the observations

and the node S the latent variables. The set of model parameters, θ, is

C,A and the set of hyperparameters, Θ, is (α,β),v, a, σ2X. The union

of these node-sets gives the graphical model G. Each of these variables will

be estimated in subsection 4.4.2. The free energy of the system becomes then

F[

Qx(x),y, θ,Θ]

= F[

Q(S),X;C,A; (α,β),v, a, σ2X

]

. (4.18)

152

Its actual expression will be derived in subsection 4.4.3.

Large datasets: Structured variational mean-field approximation

The iterative thresholding framework, as described in Daubechies et al. [45],

is completely general within the setting of sparse linear inverse problems: it

only requires that objects s ∈ S are mapped to objects x ∈ X via a bounded

operator, A, under the sparsity constraint. These assumptions reflect a wide

class of real-world signals.

BSS is inherently a multidimensional problem. Data are usually regarded

as D observations from D corresponding sensors, xdDd=1. However, there is

no restriction in thinking them as one observation, X =(x1, . . . ,xd, . . . ,xD

),

from a single “sensor”. Analogously, one can consider the L ‘spatial maps’

(sources),sl(vn)

L

l=1, where vn denotes a voxel here, as a collection of L

separate functions in RN (in the discrete case), or as one function S(vn) in

RL×N . Depending on our point of view, therefore, the problem can either be

seen as a single, 1×1 mapping S 7→ X, or L×D smaller ones. Recalling the

observation equation in matrix form, X = AS + EX, The first view implies

that we can search for the spatial components S = S(vn) as a single object,

the minimiser, S⋆, of the functional J .

Based on the probabilistic semantics of the SCA graphical model (the ‘ex-

plaining away’ phenomenon: see figure 4.3a), the source posterior is coupled

over data points, vn. (See Fig. on p. 68 of Saul’s paper and pp. 47–49 of

Beal.) This creates the problem, however, that we must compute statistics of

(D ×N)–dimensional random variables (the sources) and estimate and ma-

nipulate (DN ×DN)–dimensional matrices (the observation operator). For

many real-world problems, such as the analysis of functional MRI (where

D ∼ O(102) and N ∼ O(105)), this is still computationally very demanding.

A central assumption that we will make here, therefore, is that the poste-

153

rior over S factorizes over data points n (but not over latent dimensions20,

l):Q(S) =

∏

n

Q(sn). (4.19)

This is akin to a mean-field approximation in statistical physics, where the

probability over ‘sites’ in a system of “particles” is approximated by a factor-

ized distribution by ignoring fluctuations in local fields and replacing them

with their mean value. The equivalent “sites” here are the source data points,

sn. This assumption will allow us to compute covariance matrices of size

D×D, instead of DN×DN . This formulation leads to a variational learning

algorithm, as discussed in subsection 4.4.1. The estimation equations for all

variables in the model are presented next.

4.4.2 Estimation Equations

In this section we show that the optimum reconstruction algorithm takes

the form of generalized iterative thresholding. Inference and learning in

the SCA model involves four estimation problems: estimation of the sources

themselves and estimation of the source model, mixing model, and noise

model parameters. The variables in the SCA model may be also partitioned

into groups related to the above estimation problems:

• The ‘source’ model, GS = S,v ∪GW, where GW = Φ,C, (α,β) is

the wavelet submodel,

• The mixing model, GM = A, a,

• The sensor noise model, GN = ΣX, and

• The data, GD = X.20A fully factorized posterior, also called a ‘naıve mean-field’ approximation, although

computationally easier, misses the correlations in the source posterior, often giving sub-parresults.

154

These subgraphs are such that GS∪GM∪GN∪GD is again the whole graphical

model, G.

The full set of update equations for the BIT-SCA model is shown next.

In the following, all averages, 〈·〉, are with respect to the variational posterior

distribution over S, Q(

S∣∣∣X, θ,Θ⋆

)

, since this is the only latent variable in

the model (all other quantities are estimated at their most probable values).

We take an ‘empirical Bayes’ approach and estimate the hyperparameters

of the priors from the data.

We will present the estimation equations in groups corresponding to the

source, mixing, and noise submodels, GS, GA, and GΣ, respectively.

Bayesian Iterative Thresholding

Remark 15 Bayesian iterative thresholding forms the core of our blind in-

version algorithm. It imposes the sparsity constraint and incorporates the

sparsity prior in the computation of the posterior quantities.

It operates within the source submodel, GS.

Inference: Sources. The update equations for the sources compute the

posterior sufficient statistics for the node S. For the general case of a multi-

variate jointly Gaussian pair of random variables, (x,y), under the graphical

model x −→ y with y observed and x latent, the posterior mean vector is

µx|y = ΣxxT

[(Σxx

T)−1

µx + AT

[(AΣxxA

T + Σy|x)−1

(y −Aµx)]]

,

(4.20)

which can be re-written as

µx|y = ΣxxT

[[(Σxx

T)−1 −

(

ATΣ−1A)]

µx + AT(Σ−1y

)]

, (4.21)

155

in order to better highlight the contributions from the parent and children

nodes of node x, and the posterior covariance matrix is

Σx|y = Σxx −ΣxxT(ATΣyy

−1A)Σxx . (4.22)

In these equations, Σyy denotes the total noise in the data, which is the sum

of the sensor noise and the ‘system’ noise (i.e. source noise), propagated

through the observation operator, A,

Σ ≡ Σyy = Σy|x︸︷︷︸

sensor noise

+AΣxxAT

︸︷︷︸system noise

, (4.23)

in order to emphasize the contributions from the parent and children nodes.

Applying these to the SCA model, the source means are computed as

Sdef= 〈S〉 = VT

[(VT)−1

S + AT(AVAT + ΣX

)−1 (X−AS

)]

, (4.24)

where the lth row sl is given by

l : sl = Φcl (4.25)

for l = 1, . . . , L, and can be interpreted as the first-order (mean) ‘message’21

from the parents ϕλ weighted by the coefficients cl,λ. The quantity

X −AS is the residual, r, in observation space. The variance of the source

noise, V = diag(vl), is computed using Eq. (4.29).

Equation (4.21) is to be contrasted with the equivalent deterministic

quantity,(I−ATA

)s(i) + ATx, which is the argument of the thresholding

operator, Sτ (·), in the estimation equation (3.17). We can see that in order

21For an interpretation of EM-type algorithms as message passing, see Eckford [56] andDauwels et al. [47].

156

to get the original IT, the covariance matrices V and Σ must be implicitly

set to be the unit matrices22 IL and ID, respectively23. We can also compare

Eq. (4.24) with the equivalent equation in the classical gradient descent solu-

tion, f (i+1) = η(i)HTg +(I− η(i)HTH

)f (i): the ad-hoc “learning rate”, η(i),

which is, in general, tricky to set, can be explained probabilistically as the

ratio vl/σ2. This is automatically set in this model.

Finally, the second-order posterior sufficient statistics is

⟨SST

⟩= 〈S〉〈S〉T +NCSS , (4.26)

where the covariance matrix, CSS, is

CSS = V −VT(ATΣ−1

X A)V , (4.27)

for l = 1, . . . , L.

Remark 16 Note that the posterior covariance matrix, CSS, is given by

the prior covariance matrix, V, after we subtract the second term of the

equation (4.27): this means that observing the data, X, makes the poste-

rior distribution Q(S) narrower, since we obtain more information about the

system.

Source noise. We estimate the precision hyperparameter (inverse vari-

ance) of the l–th source noise, 1/vl, using the following statistics of the

22Note, however, that the quantity estimated in Eq. (4.24) is the posterior mean ofthe random variable S. On the contrary, the sources in the original IT are deterministicquantities, and the source noise “variances” are vsl

→ 0.23Σ = · · · , E = · · · .

157

Gamma distribution24 Ga (bvl, cvl

):

l :

c⋆vl= 1

2N

b⋆vl=(

12

[

tr(

Cslsl

)

+(〈sl〉 − clΦ

)(〈sl〉 − clΦ

)T])−1

,(4.28)

where N is the cardinality of the set sl,1, . . . , sl,N, which forms the children

of the node vl, and the quantity in square brackets in the second equation is

the second-order posterior statistics of the residual information in the node

sl. The source noise variances, v = (v1, . . . , vL), are then computed as

v⋆l =

(b⋆vlc⋆vl

)−1, l = 1, . . . , L . (4.29)

Wavelet coefficients: Bayesian shrinkage. We now show how the wavelet

coefficients are updated by fusing information from the sparse prior and the

data, i.e. the parent and children nodes of C, respectively. This leads to a

Bayesian derivation of a ‘wavelet shrinkage’ rule. The resulting update rule

contains soft thresholding as a special case.

The update equation for C is gotten by minimizing F with respect to C.

We have:

C = argmaxC

∑

l

−1

2vl

[

−2〈sl〉 (clΦ)T + clcl

]

+∑

l

p(cl|αl, βl)

,

(4.30)

and

cl = Φ−1sl, for l = 1, . . . , L . (4.31)

24The Gamma distribution is suitable for modelling scale parameters (positive quan-tities) and is conjugate to a Gaussian likelihood with unknown mean. For a Gamma-

distributed random variable x ∼ Ga(b, c) = 1Γ(c)bc xc−1e−

1

bx, its sufficient statistics are 〈x〉

and 〈log x〉. Here we consider it at the uninformative limit, c → 1, 1/b → 0, where thebars denote information coming from the prior of x. Therefore its parameters are updatedusing information from the data only.

158

An important special case is obtained if we use the Laplace prior,

log pl(cl|βl) ∝ −βl‖cl‖1 = −βl

∑

λ

|cl,λ|, (4.32)

corresponding to the ℓ1 prior of Daubechies et al. (where p = 1 is small-

est allowable value of the exponent of the constraint in the algorithm of

DDD). The update equation for the wavelet coefficients then becomes the

soft-thresholding function Sτ : c 7→ c in this case (see Fig. 4.6):

l : cl,λ = sgn(cl,λ)(|cl,λ| − τl) I|cl,λ|>τ⋆l , λ = 1, . . . ,Λ , (4.33)

applied component-wise. In the above equation, I|x|>x⋆ is the indicator

function25, defined as I(x) = 1 if |x| > x⋆ and I(x) = 0 otherwise (see

Fig. 3.8 for a representative plot). This operation sets all values for cλ in

[−τ ⋆, τ ⋆] to 0, where the optimal value for the threshold, τ ⋆, is computed

below.

However, a flexible sparse prior will be used in this model that is a gen-

eralization of the above, which is one that adapts its shape according to

the sparsity of the inferred signal (see Fig. 4.5). In particular, we use a

generalized exponential on cl, defined as

pl(cl,λ) =1

2Γ(1/αl)

αlβ1/αll

exp(

− βl|cl,λ|αl

)

, ∀cl,λ ∈ cl , (4.34)

where the denominator Z(αl, βl) = 2Γ(1/αl)

αlβ1/αll

is a normalizing constant. This

quantity will play a role later, during the estimation of the shape and scale

25Another reason the indicator function is used in the thresholding function (see Fig. 3.8)is consistency: if an input coefficient c0

λ ∈ [−τ, τ ] were just mapped to c0λ = sgn(c0

λ)(|c0λ| −

τ), then due to the shift by τ negative coefficients would become positive and vice versa.The only way to maintain both conditions of Eq. (4.33) for −τ ≤ c0

λ ≤ τ is when c0λ = 0.

159

-12

-10

-8

-6

-4

-2

0

2

4

6

8

10

12

-12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12

Out

put,

thre

shol

ded

coef

fici

ent

Input wavelet coefficient

Thresholding functions

-τ τ

sτ(x)s(τ,δ)(x)

hτ(x)

-12

-10

-8

-6

-4

-2

0

2

4

6

8

10

12

-12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12

Out

put,

thre

shol

ded

coef

fici

ent

Input wavelet coefficient

Thresholding functions

-τ τ

Figure 4.6: Thresholding (shrinkage) functions c 7→ c. Blue curve:hard threshold with parameter τ , hτ (c); red curve: soft threshold, sτ (c); greencurve: a smooth approximation to the soft threshold, sτ,δ(c) = c− vβ c√

c2+δ,

with parameter τ = τ(β, v), with c√c2+δ→ sgn(c) as δ → 0.

parameters, αl, βl, from the data. The update equation for the wavelet

coefficients under a generalized exponential prior then becomes

cl,λ = cl,λ − vl

(αlβl|cl,λ|αl−1sgn(cl,λ)

)I|cl,λ|>τ⋆

l . (4.35)

Noting that the factor −αlβl|cl,λ|αl−1sgn(cl,λ) in the above shrinkage rule is

the ‘score function’, ψ(c), defined as the (negative) logarithmic derivative of

the prior distribution on cl,λ, ψ(c) = −d log p(c)dc

, we can rewrite it as

cl,λ = cl,λ + vl ψl(cl,λ) I|cl,λ|>τ⋆l , where cl,λ =

(ΦT〈sl〉T

)

λ.

We can then give the following interpretation:

Remark 17 The new value of the wavelet coefficient, cl,λ, is computed by

adjusting the information coming from the child of the node cl,λ, i.e. the

160

wavelet analysis of the source sl, by the sparse prior weighted by the variance

of the system (source) noise. For high-precision sources we have vl → 0 and

the input wavelet coefficients pass through the ‘non-linearity’, ψ(·), almost

unchanged; see also the estimation equations for the threshold, τl, in sub-

section 4.4.2, used in the above shrinkage rule, and Fig. 4.6: the shape of

the shrinkage curve tends to the line c = c as τl → 0. On the contrary, in

high-noise regimes the second term will have a significant impact; also, the

value of the threshold will be larger, filtering out all but the largest wavelet

coefficients as noise.

The hyperparameters, (αl, βl), will be estimated in subsection 4.4.2, ‘Adap-

tive Sparsity’, and the value of the threshold, τ ⋆l , will be computed in sub-

section 4.4.2, ‘Thresholds’, for both cases26.

Learning

In blind inverse problems, we must also estimate the observation operator, A.

In addition, in order to perform automatic denoising, the parameter of the

observation noise, ΣX, must be estimated. The rest of the equations are the

precision of the mixing matrix, a, and the updates for the hyperparameters

of the wavelet prior, (α,β). The latter are discussed in subsection 4.4.2.

Under the Bayesian framework, constraints (priors) on the parameters

can naturally be imposed. These enter the corresponding update equations

so as to “steer” the estimated values of the parameters towards preferred

regions of the parameter space.

26The case for the Laplace prior is simply a special case with the shape parameter, α,set to 1.

161

Mixing model The update equation for the parameter, A, of the mixing

model, GA, is given by the implicit equation

A =(

X 〈S〉T −ΣXaA)( ⟨

SST⟩ )−1

, (4.36)

which is obtained by differentiating F with respect to A.

For isotropic observation noise, ΣX = σ2XID, the update equation can be

solved explicitly, to give:

A =1

σ2X

(

X 〈S〉T) (

aIL +1

σ2X

⟨SST

⟩)−1

. (4.37)

In the above equation, the term aIL is the information coming from the par-

ent, a, of A. The hyperparameter a corresponds to the Lagrange multiplier

λA, modulating the corresponding penalty term of the energy functional, E ,if λA = σ2

Xa. Its estimate is given next.

The mixing matrix precision hyperparemeter, a, assuming a Gamma-

distributed variable again, a ∼ Ga (a⋆a, 1/b

⋆a), is estimated as:

a = a⋆a/b

⋆a, (4.38)

where

a⋆a =

1

2DL and b⋆a =

1

2

D∑

d=1

L∑

l=1

(Adl)2. (4.39)

Sensor noise model. The variable ΣX of the sensor noise model, GΣ, i.e.

the sensor noise covariance, is updated next. Optimization of the negative

free energy with respect to ΣX is equivalent to estimating the observation

162

noise covariance at its maximum likelihood value27.

ΣX =1

N

⟨

(X−AS)(X−AS)T

⟩

, (4.40)

where N is the number of data points28. Another option is to use the effective

sample size, Neff = N − p, where p is the number of parameters estimated

from the data.

For isotropic observation noise, ΣX = σ2XID, we obtain

σ2X =

1

DTr(ΣX) , (4.41)

where D is the number of sensors.

Adaptive sparsity: estimating the shape and width of the general-

ized exponentials

In the original IT algorithm of DDD, the exponent, p, of the ℓp penalty

on the wavelet coefficients, and the corresponding Lagrange multiplier, λC,

were chosen a-priori. The hyperparameters, (αl, βl), of the wavelet coefficient

model can be learned from the data as well, by maximizing the variational

objective, F , with respect to these hyperparameters.

For the Laplace prior distribution (α⋆l ≡ 1) on the l–th source coefficients,

cl,λλ, which corresponds to an ℓ1 norm, an estimate of βl based on the

27Strictly speaking, ΣX is a hyperparameter positioned at one of the roots of the graph,according to our model specification. To be a parameter proper, it should also be as-signed a prior, such as an inverse Gamma distribution, ΣX ∼ IGa(bΣX

, cΣX). Following

Ghahramani and Beal [73], we consider it at the limit of bΣX→∞ and cΣX

→ 0 here. Inpractice we recover some of the “subjective information” provided by the prior by settinga minimum allowed variance for the noise, σ2

Xmin, estimated using a modified ‘universal’

threshold (see Donoho, [54]). This works quite well in practice. The model could beextended to cover the informative prior case.

28This is the total number of observations per sensor. For example, in the case of images,N = Nx ·Ny · · · .

163

current values of the wavelet coefficients, cl,λλ, is

β⋆l =

√2

√1Λ

∑

λ (cl,λ)2, where Λ = |cl,λλ| , for all l . (4.42)

Although the choice for the exponent p (equivalent to our α) to be greater

than or equal to 1 in Daubechies et al. [45] and in SCA/IT leads to a convex

problem, we found that it is not flexible enough. Moulin and Liu stud-

ied the subject of ‘Bayesian robustness’ (i.e. the effect of misspecifying

the prior) with respect to sparsity and wavelets [129] and used the prior

π(θ) = a exp (−|bθ|ν), concluding that “the shape parameter, ν, is typically

between 0.5 and 1”. Our experiments support their experimental result. Lep-

orini and Pesquet (2001) also consider the corresponding exponent, r > 0.

The choice of α < 1 has also been dealt with in Kreutz-Delgado et al. [100].

We do not put any restriction on the value of α, other than α > 0, and let it

be inferred by the algorithm29.

For the generalized exponential prior, which corresponds to an ℓp norm,

we follow Everson and Roberts in [58] (the ‘flexible nonlinearity’ approach)

and estimate the hyperparameters α and β at their most probable values.

The relevant part of the NFE to be optimized is

F [· · · ;α,β] =∑

l

−βl

∑

λ

|cλ|αl+∑

l

logαl +1

αl

log βl − log Γ(1/αl)− log 2

︸︷︷︸

− logZl

.

The two-dimensional parameter space (α, β) (for each latent dimension, l)

is then optimized as follows. Taking the derivatives with respect to βl and

29In practice, for convergence/convexity reasons, we employ a two-phase scheme: we letα be 1 during the first stages of learning and let it be freely adapted and refined at thelater stages, when the algorithm has reached a stable regime.

164

setting them to zero gives a one-parameter family of curves

βl = βl(αl) =1

αl1Λ

∑

l |cl,λ|αl. (4.43)

We can then solve the one-dimensional optimization problem dFdα

= 0 by

taking the total derivative

dFdα

=∂F∂α

+∂F∂β

dβ

∂α= 0 ,

which becomes equal to the first term only, if searched the curve ∂F∂β

= 0.

Then,

∂F∂αl

=1

αl− 1

α2l

log βl +1

α2l

Ψ(1/αl)− βl1

Λ

∑

l

|cl,λ|αl log |cl,λ| = 0 , (4.44)

where Ψ(·) is the digamma function. The above one-dimensional problem

can be solved using any standard nonlinear optimizer. The above procedure

gives the optimal values (α⋆l , β

⋆l )Ll=1.

We will also need to compute the variance of the wavelet coefficients, σ2cl,λ

.

From the generalized exponential density, cl,λ ∼ GE(αl, βl) = 1Z e

−|βlcl,λ|cl ,

1Z = βlcl

2Γ(1/cl), ∀λ ∈ 1, . . . ,Λ, we get

σ2cl,λ

=

(Bl

βl

)

, where Bl =

(Γ(3/αl)

Γ(1/αl

)1/2

. (4.45)

Thresholds

A key feature of the model is that it automatically sets the thresholds, τl,

such that it balances signal reconstruction accuracy and sparseness. The

final reconstruction is an optimal, under the model, trade-off between noise-

less/smooth and over-blurred results.

165

For the Laplace prior, the optimal value for the threshold, τ ⋆l , in the

soft-thresholding rule of equation (4.33) is computed as

l : τ ⋆l = v⋆

l β⋆l , (4.46)

where the optimal values of the hyperparameters βl and vl (starred) are also

estimated by the model. These are computed in subsections 4.4.2 and 4.4.2,

respectively.

For the generalized exponential prior, the threshold value for the shrink-

age function Sτ : c 7→ c of equation (4.35), does not have a simple expression.

In this case, the ‘cutoff values’ τ ⋆l = τl(α

⋆l , β

⋆l , v

⋆l ), defined as the values of c

for which c is zero, must be computed as the roots of the one-dimensional

curve x − vαβ sgn(x)|x|α−1 = 0, β ≥ 0, where x ≡ c. Since this is sym-

metric around 0 it suffices to only solve for x ≥ 0. Then sgn(x) = 1 and

|x|α−1 = xα−1, so the final equation becomes x − vαβxα−1 = 0. This is a

non-linear equation in x and we solve it using a numerical method30.

4.4.3 Computing the SCA free energy

From the definition of negative free energy, F , Eq. (4.15), and the directed

graphical model structure (Figs 4.2, 4.3), the negative free energy decomposes

in terms related to the observations, latent variables and model parameters,

respectivelly, plus an additional term that is the entropy of the latent vari-

ables with respect to their variational posterior, Q(S). We have:

F = FX + FS + FC + FA + FH (4.47)

30For example using the function fsolve of GNU Octave.

166

where the individual contributions are computed as

FX =∑

n

−1

2D log(σ2

X)− 1

2Tr[(1/σ2

X)(xnx

T

n − 2xn

⟨sT

n

⟩AT + A

⟨sns

T

n

⟩AT)]− 1

2D log(2π)

,

(4.48)

FS =L∑

l=1

−1

2N log(2π) +

1

2N log(v⋆

l )−1

2v⋆

l

((

Nσ2Sl,l

+ µslµT

sl

)

− 2〈sl〉sT

l + slsT

l

)

,

(4.49)

FC =

L∑

l=1

Λ

[

− log 2− log(Γ(1/α⋆l )) + log(α⋆

l ) +1

α⋆l

log(β⋆l )

]

− β⋆l

∑

|cl,λ|α⋆l

,

(4.50)

FA = −1

2DL log(2π)− 1

2DL log(a⋆)− 1

2D − 1

2a⋆∑

d

∑

l

A2dl , (4.51)

FH = N1

2(L log(2π) + L+ log(| detΣSS|) . (4.52)

4.4.4 Algorithm

The pseudo-code for the variational mean-field Bayesian iterative threshold-

ing algorithm is shown below (Algorithm 3):

167

Algorithm 3 BIT (MF)

1: Whiten the observations, X2: Let δF = 10−7, δFmin = 10−7, tmaxinner

= 1000, tmax = 10000, epoch← 13: Initialize: let S← S(0), θ ← θ(0), Θ← Θ(0)

4: while δF > δFmin & tsum < tmax do5: t← 16: while δF > δF & t < tmaxinner

& tsum < tmax do7: Calculate S from Eq. 4.258: Infer source sufficient statistics from Eqns. 4.24 and 4.269: Re-estimate the mixing matrix, A, from Eq. 4.37

10: Calculate C from Eq. 5.4011: Learn wavelet coefficients, C, from Eq. 4.3012: Calculate negative free energy, F , from Eqns 4.47 to 4.5213: Let δF = |(F − Fold)/Fold|14: t← t+ 115: end while16: Let tsum = tsum + t17: Let epoch← epoch + 118: Anneal: δF = 0.1 δF19: Estimate wavelet prior hyperparameters, (α,β), from Eqns 4.43

to 5.4220: Estimate the Sensor Noise variance, ΣX, from Eq. 4.4121: Estimate system noise precision, v⋆, from Eq. 4.2922: Compute threshold, τ , by numerically solving x − vαβxα−1 = 0 and

τ ⋆ = τ ⋆(α⋆, β⋆, v⋆)23: Estimate precision of the mixing matrix, a⋆ from Eqns 4.38 and 4.3924: end while

4.5 Discussion

4.5.1 Comparison of ordinary and Bayesian iterative

thresholding

First, we note that the original version of IT, Eq. (3.17), does not regard

the sources as latent variables, but instead operates directly on the unknown

signal, s, and its wavelet transform, s. The output of the algorithm is a point

168

estimate of the unknown signals. Moreover, the smoothing on s is performed

(by the shrinkage rule) using prior information from the ℓp penalty only. In

the Bayesian version, some of the “roughness” in s is attributed to system

noise, and the rest to the complexity of the wavelet representation (via the

wavelet coefficients) itself.

Equation (3.17) was derived by exchanging the original optimization func-

tional, I, with a sequence of surrogate variational functionals, I. In our ver-

sion, the data likelihood is lower-bounded by the negative free energy. Due

to the strict lower bound and the convergence properties of the variational

EM algorithm, the BIT algorithm is guaranteed to converge. The original IT

has also guananteed convergence properties, however, our algorithm is also

guaranteed to increase the NFE at each iteration.

The estimation equations (4.24)–(4.35) and (4.40), (4.41) generalize the

equations of the deterministic iterative thresholding algorithm by including

the precisions of the sources and sensors, 1/vl and 1/σ2X, respectively. The

deterministic version of this algorithm does not take uncertainty into account,

and does not compute expectations of variables over their corresponding

posteriors: the equivalent of Eqns. (4.24) and (4.26) in standard IT is a

‘point estimate’ of the unknown signal, s, at the current iteration step31. The

update equations (4.24) and (4.26), on the other hand, provide a “softer”

estimate of s, by including an estimate of uncertainty in the sensors and

sources.

Remark 18 An interpretation of Eq. 4.24 (see also Fig. 3.10) is the follow-

ing. Each node sl combines two sources of information: one from its parent

nodes, ϕλ, representing our prior model/assumptions for the unknown sig-

nals, weighted by cl,λ and captured in the ‘message’ s, and one from its

31Essentially, the original IT algorithm iterates the current value of the unknown signalvia the fixed-point equation s(n+1) = F

(s(n)

), where the argument to the thresholded

Landweber operator, F, is the previous value of the sought-for (unknown) signal: seeEqns 3.17 and 3.19; it does not “fuse” information from other parts of the model.

169

children nodes, xd, representing the measured data, weighted by Ad,l and

captured in the residuals, r. It then weighs those messages according to their

precisions, 1/vl and 1/σ2X, respectively, in order to fuse these two sources of

information together and update its “state”, captured in its sufficient statis-

tics.

Second, in this text we are interested in blind inverse problems; therefore,

we compute an estimate of the observation operator, A, itself. The Bayesian

approach allows us to treat the mapping from unknowns to observations (i.e.

the mixing matrix) as an uncertain quantity as well, and put a prior on it. It

also allows us to compute its width hyperparameter directly from the data.

We also estimate the parameters of the observation noise model, ΣX.

Third, the original variational functional E is parameterised by two La-

grange multipliers, λC and λA, for the two penalty terms corresponding to

C and A, respectively: setting these correctly is crucial for the successful

estimation of the unknowns. Importantly, we also estimate the hyperparam-

eters of the priors, (αl, βl), and the hyperparameter v of the ‘system noise’

model, which determines the strength (variance) of the noise on the sources.

The scale parameters βl and a determine the amount of penalization of the

two priors, and they are the probabilistic quantities corresponding to the La-

grange multipliers of the original functional, I. These in turn determine the

thresholds, τl = τl(αl, βl, vl), which is of utmost importance for the precise

reconstruction of the unknown signals.

170

4.6 Experimental Results

4.6.1 BIT-SCA: Simulation Results

This subsection illustrates blind source separation using the BIT-SCA algo-

rithm for the ThreeSources dataset of [147], [148] for a typical run. Two

fragments of music (512 samples) sampled at fs = 11.3kHz were mixed with

a fragment of an EEG signal and corrupted with random Gaussian noise with

power 10% of the signal power to give three observated signals. This is in-

tended to be a comprehensive investigation of the behavior of the algorithm

with respect to noise, sparsity, etc.

The algorithm was initialized using a sphering transformation. In par-

ticular, the wavelet coefficients of the sphered observations were computed

and the initial value of the threshold, τ , was set using the ‘universal thresh-

olding’ method of Donoho [54]. The initial value of the sensor noise variance

was computed from the median absolute deviation of the coefficients at the

smallest scale (usually regarded as noise in the classical approach), and the

source noise variance was set to a fraction (e.g. 1/10) of the sensor noise

variance. The Daubecies’ D8 wavelet was used for this test; we found that it

provides a good balance between local support and smoothness.

We applied an adaptive converge monitoring scheme. In particular, an

“annealing” scheme was used that controlled the convergence tolerance of the

algorithm, defined as the difference in the value of the negative free energy,

F , between two iterations: at initialization the tolerance was set to a small

value (10−4, say), by default, and was getting stricter by a power of 10 in

the next epoch. This allowed freer exploration during the first stages of a

run and more restricted later, as we converged to the solution. Neither the

number nor the length of each epoch was set in advance; instead, they were

adaptively reset at every run by this scheme. During the first epoch, we fix

171

the hyperparameter αl to 1. The algorithm converged after 1277 iterations

in less than 10sec on a 2GHz machine32 .

The reconstruction of the sources under 10% additive random noise is

shown in figure 4.7. The results show that BIT-SCA recovered the original

sources with high accuracy. Figure 4.8 shows the scatterplot of the original

vs the reconstructed sources, (sl,n, sl′,n)Nn=1, where l, l′ = 1, . . . , L.

Numerical measures of source reconstruction are shown next: the cor-

relation coefficients between the true and reconstructed sources are r11 =

0.964401, r22 = 0.965299, r33 = 0.962916 (cf. Fig. 4.8). The relative mod-

elling error, εη = ‖si − si‖/‖si‖, was: s1: εη = 0.267, s2: εη = 0.265, s3:

εη = 0.273. The source reconstruction error, εrec, was: s1: εrec = 0.071084,

s2: εrec = 0.069898, s3: εrec = 0.074145. Finally, the crosstalk between the

sources, εxtalk, was: 0.028913. The sparsity of the sources is s1 : 53.71%,

s2 : 58.01%, s3 : 67.97% non-zero coefficients, respectively.

The wavelet-based prior leads to a very sparse reconstruction of the

sources, as is evident from the histograms of the estimated coefficients,

cl,λλ. Figure 4.9 shows a stem-plot of the coefficients and figure 4.10 their

histograms. In particular, the hyperparameters of the wavelet model were

estimated as (α1, β1) = (0.28745, 5.6829), (α2, β2) = (0.28457, 5.7235), and

(α3, β3) = (0.38163, 4.3601). The sparsity of the sources is s1 : 53.71%,

s2 : 58.01%, s3 : 67.97% non-zero coefficients, respectively.

Figure 4.11 shows the evolution of the elements, Iii′ , i, i′ = 1, . . . , DL, of

the matrix

Idef=

((

ATA)−1

AT

)

A

during the run; perfect estimation gives a unit matrix. The matrix after

32Using GNU Octave ver. 3.2.3 and the ATLAS libraries running under Linux.

172

-4-3-2-101234

0 100 200 300 400 500 600

Source3

True sourceEstimated source

-3

-2

-1

0

1

2

3

0 100 200 300 400 500 600

Source2


-4-3-2-10123

0 100 200 300 400 500 600

Source1


Figure 4.7: ‘ThreeSources’ dataset, 10% noise. Source reconstruction usingthe BIT-SCA algorithm. In order to ease comparison, the reconstructedsources were overlayed with the original ones.

-3

-2

-1

0

1

2

3

-4 -3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

-4 -3 -2 -1 0 1 2 3 4-4

-3

-2

-1

0

1

2

3

4

-4 -3 -2 -1 0 1 2 3 4

-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3-4

-3

-2

-1

0

1

2

3

4

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

-4 -3 -2 -1 0 1 2 3-3

-2

-1

0

1

2

3

-4 -3 -2 -1 0 1 2 3-4

-3

-2

-1

0

1

2

3

4

-4 -3 -2 -1 0 1 2 3

Figure 4.8: ‘ThreeSources’ dataset, 10% noise. Scatterplot of original vsreconstructed sources via BIT-SCA.

173

-15

-10

-5

0

5

10

0 100 200 300 400 500 600

Source3

-8-6-4-20246

0 100 200 300 400 500 600

Source2

-10

-5

0

5

10

0 100 200 300 400 500 600

Source1

Figure 4.9: ‘ThreeSources’ dataset, 10% noise. Wavelet coefficients of thereconstructed sources.

0

0.1

0.2

0.3

0.4

0.5

0.6

-10 -5 0 5 10

Source 3

0

0.1

0.2

0.3

0.4

0.5

0.6

-10 -5 0 5 10

Source 2

0

0.1

0.2

0.3

0.4

0.5

0.6

-10 -5 0 5 10

Source 1

Figure 4.10: ‘ThreeSources’ dataset, 10% noise. Empirical histograms of thereconstructed sources estimated by BIT-SCA.

174

-1

-0.5

0

0.5

1

1.5

2

2.5

406 814 1277

Evolution of the elements Iii’hat of the matrix Ihat = (AT A)-1 AT Atrue

Figure 4.11: ‘ThreeSources’ dataset, 10% noise. Evolution of the elements ofthe matrix I during a typical BIT-SCA run.

convergence is

I =

0.979 0.011 −0.011

0.018 0.996 0.007

−0.034 −0.012 0.970

The final angle difference between the original and estimated mixing vectors,

θll = ∠ (al, al), is θ11 = 2.674, θ22 = 0.894, θ33 = 0.629. Figure 4.12 shows

the evolution of the Amari distance, DA, [81] during the run, measuring the

distance between the true and estimated matrix, up to permutations and

scalings,

DA(A,B) =

N∑

i=1

N∑

j=1

∣∣∣(AB−1)ij

∣∣∣

maxj′

∣∣∣(AB−1)ij′

∣∣∣

+

∣∣∣(AB−1)ij

∣∣∣

maxi′

∣∣∣(AB−1)i′j

∣∣∣

−2N2.

Finaly, figure 4.13 shows the evolution of the negative free energy, F ,

175

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

406 814 1277

Learning curve: Evolution of Amari Distance DA: DAfinal = 0.01563

Figure 4.12: Learning curve. ‘ThreeSources’ dataset, 10% noise. Evolutionof Amari Distance DA between the true and estimated mixing matrix duringa typical BIT-SCA run.

during the run. Note that the jumps at iteration numbers 406 and 814 are

due to the reestimation of the hyperparameters at the end each epoch and

that F is always monotonically increasing, with Ffinal = −0.19199× 104.

176

-0.25 x 1e4

-0.24 x 1e4

-0.23 x 1e4

-0.22 x 1e4

-0.21 x 1e4

-0.2 x 1e4

-0.19 x 1e4

406 814 1277

Evolution of NFE, F, after 1277 iterations (3 epochs): Ffinal = -0.19199 x 1e4

Figure 4.13: ‘ThreeSources’ dataset: BIT-SCA: Evolution of negative freeenergy, F , after 1277 iterations (3 epochs). Ffinal = −0.19199× 104.

4.6.2 Extraction of weak biosignals in noise

We consider again the application of BIT-SCA to the functional MRI dataset

of Friston, Jezzard, and Mathews [71] described in Sect. 3.4.3. We run the

BIT-SCA algorithm on the dataset in order to extract the Consistently Task

Related (CTR) components. The component whose corresponding time-

course correlated most with the stimulus/expected timecourse was identi-

fied as the CTR component, provided that its correlation coefficient, r, was

higher than 0.3. The resulting spatial map and corresponding time-course

are shown in Fig. 4.14 and 4.15. The correlation coefficient was r = 0.941

for this experiment.

To quantitatively assess the quality of the separation we plotted the ROC

curve using the result from the GLM as our ground truth. This is shown in

Fig. 4.16. The AUC was A = 0.893 for this experiment. We can see that the

result is better, to some degree, than the energy-based IT algorithm (where

177

Figure 4.14: Spatial map s1 from BIT-SCA. The component whose corre-sponding timecourse correlated most with the stimulus/expected timecoursewas identified as the CTR component.

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

10 20 30 40 50 60

SC: timecourse s1



Figure 4.15: Timecourse a1 from BIT-SCA, corresponding to the spatial maps1 (thick curve). The time-paradigm, ϕ(t), is the blue curve and the expectedtimecourse, h(t)⊗ ϕ(t), is the green one.

178

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Figure 4.16: Receiver Operating Characteristic (ROC) curve between theBIT-SCA derived CTR map and the corresponding map from SPM.

the corresponding results were r = 0.940 and AUC A = 0.861). That is, the

correlation coefficient for the time-course was only slightly better, while the

AUC for the spatial map was improved. This can be attributed to the use of

the more adaptive model on the wavelet coefficients of the spatial maps and

the inclusion of the state noise model.

179

Chapter 5

Variational Bayesian Learning

for Sparse Matrix Factorization

In this Chapter1 we present a fully Bayesian bilinear model for factorizing

data matrices under sparsity constraints with special emphasis on spatio-

temporal data such as fMRI. We exploit the sparse representation of the

spatial maps in a wavelet basis and learn a set of temporal “regressors” that

best represent, i.e. are maximally relevant to, the source timecourses.

5.1 Sparse Representations and Sparse Ma-

trix Factorization

A fundamental operation in data analysis often involves the representation of

observations in a ‘latent’ signal space that best reveals the internal structure

1A shorter version of this Chapter was published as E. Roussos, S. Roberts, and I.Daubechies. Variational Bayesian Learning of Sparse Representations and Its Applicationin Functional Neuroimaging. In Springer Lecture Notes on Articial Intelligence (LNAI)Survey of the State of The Art Series, 2011.

180

of the data. Inversion of the standard linear model of component analysis,

viewed as a data representation model, forms a mapping from the space of

observations to a space of latent variables of interest. In particular, in its use

as a multi-dimensional linear model for source separation, the observation

space, spanned by the canonical basis2 of RD (for a D–dimensional obser-

vation space), is transformed into a space spanned by the columns of the

‘mixing’ matrix. This new basis should, hopefully, reveal an intrinsic quality

of the data. In the linear model for brain activation [172], for example,

the spatio-temporal data, X = X(t,v), where v is a voxel in a brain vol-

ume, Ω ⊂ R3, and t is a time-point (t = 1, . . . , T ), is modelled as a linear

superposition of different ‘activity patterns’:

X(t,v) ≈L∑

l=1

Sl(v)Al(t) , (5.1)

where Al(t) and Sl(v) respectively represent the dynamics and spatial vari-

ation of the lth activity. Our goal is the decomposition of the data set into

spatio-temporal components, i.e. pairs(al, sl)

L

l=1such that the “regressors”

alLl=1 capture the ‘time courses’ and the coefficients slLl=1 capture the ‘spa-

tial maps’ of the patterns of activation. Unlike model-based approaches, such

as the general linear model (GLM), in the ‘model-free’ case [15], addressed

here, both factors must be learned from data, without a-priori knowledge of

their exact spatial or temporal structure. The problem described above is

ill-posed, however, without additional constraints. These constraints should

capture some domain-specific knowledge (and be expressible in mathemat-

ical form). In the context of neuroimaging data analysis, Woolrich et

al. [171] developed a Bayesian spatio-temporal model for functional MRI

employing adaptive haemodynamic response function (HRF) inference and

2The canonical basis, BI, of RD is comprised of vectors,

ed

D

d=1, where ed is the unit

vector (0, . . . , 0, 1, 0, . . . , 0), with 1 in the dth position and 0 elsewhere.

181

noise modeling in the temporal and spatial domain, accommodating noise

non-stationarities in different brain areas.

The main tool for exploratory decompositions of neuroimaging data into

components currently in use is independent component analysis (ICA). As its

name suggests, ICA forces statistical independence in order to derive maxi-

mally independent components. The bilinear decomposition described above

is related to the blind source separation (BSS) problem [125], [15]. Indeed,

blind separation of signals into components is inherently a bilinear decom-

position. Nevertheless, many component analysis algorithms for solving the

BSS problem, such as classical “independent” (least-dependent) component

analysis, make various simplifying assumptions in order to solve the mul-

tiplicative ill-posedness of bilinear decomposition [87], [148]. Infomax ICA

and FastICA, the two most popular ICA algorithms for fMRI data analysis,

restrict the problem to the noiseless, square case and estimate an unmixing

matrix using an optimization criterion that maximizes statistical indepen-

dence among the components, or, in practice, appropriate proxy functionals

[16], [89]. This matrix is then fixed and used to estimate the ‘sources’ by

operating directly on the observation signals. Probabilistic ICA (PICA) [15],

a state-of-the-art model developed for the analysis of fMRI data, first reduces

the dimensionality of the problem by estimating a lower-dimensional signal

subspace using Bayesian model order selection for PCA [126], regarding the

additional dimensions as noise, and then performs ICA on it.

However, despite its success, there are both conceptual and empirical is-

sues with respect to using the independence assumption as a prior for brain

data analysis [123], [46]. In particular, as discussed previously in this Thesis,

there is no physical or physiological reason for the components to correspond

to different activity patterns with independent distributions. Recent exper-

imental and theoretical work in imaging neuroscience [46], [30] reveals that

activations inferred from functional MRI data are sparse, in some appropri-

182

ate sense. In fact, in Daubechies et al. [46] the key factor for the success of

ICA-type decompositions, as applied to fMRI, was identified as the sparsity

of the components, rather than their mutual independence. We shall exploit

this sparseness structure for bilinear decomposition.

In matrix form, the problem of sparsely representing the data set Xbecomes a problem of sparse matrix factorization of its corresponding data

matrix, X (Srebro and Jaakola [159], Dueck et al., [55])3. “Unfolding” the

volume of each scan,X(t,v)

v∈V , at time-point t, as a long vector4 xt ∈RN , where N = |V| is the total number of voxels, and storing it into the

corresponding row of X, for all t = 1, . . . , T , the aim is to factorize the

resulting T ×N data matrix,

XT×N

=

x1,1 · · · x1,N

.... . .

...

xT,1 · · · xT,N

,

into a pair of matrices, (A,S), plus noise (see also Fig. 5.1):

XT×N≈ A

T×L· S

L×N. (5.2)

The set alLl=1 of prototypical timeseries contains the basis and spans the

column space of the data matrix. The nth column of S, sn, contains the

coefficients sl,nLl=1, which combine these vectors in order to ‘explain’ the

nth column of X. Dually, the lth row of S, sT

l , contains the lth spatial

3Srebro and Jaakola and Dueck et al. consider the sparse representation case in whichwe seek a minimal number of representation coefficients. This corresponds to an ℓ0 penalty.We do not restrict ourselves to this case in this Chapter, but rather learn the penaltyadaptively from the data.

4The unfolding operation, V 7→ Ix, where V is a volume and Ix denotes the set of linearindices of the vector x, is a bijection. We can always recover the volume-valued data usingthe inverse operation.

183

Xt =X(t,v)

v∈Ω

xT

t

xn

t t

nn

alLl=1

aT

t

sl,nAl(t) Sl(v)

sn

sT

l

L

l=1

TTL

L NN1

111

11

l

l

≈X AS

...

Figure 5.1: Matrix factorization view of component analysis, as a bilineardecomposition model for spatio-temporal data. The T ×N observation ma-trix, X, is decomposed into a pair of matrices, (A,S), plus noise, such thatthe matrix A contains the dynamics of the components, Al(t)Ll=1, in its

columns and the matrix S the spatial variation, Sl(v)Ll=1, in its rows. BothA and S are to be estimated by the model. Our aim is to perform a sparsematrix factorization of X, such that both the basis set, alLl=1, and theresultant maps, sl, for all l, are sparse in some appropriate sense.

map. Our objective is to invert the effect of the two factors, coupled by

Eq. (5.2). In addition, if the data lies in a lower-dimensional subspace of

the observation space, that is to say, the number of latent dimensions is less

than the number of observation dimensions, L < minT,N, as is very often

the case in real-world datasets, then the method should also seek a low-rank

matrix factorization.

The sparse decomposition problem, modelled as a blind source separa-

tion problem, can in some cases be solved by algorithms such as independent

component analysis, viewed as a technique for seeking sparse components,

assuming appropriate distributions for the sources. Indeed, as shown by

Daubechies et al. [46], ICA algorithms for fMRI work well if the compo-

nents have heavy-tailed distributions. However, ICA approaches are not

optimized for sparse decompositions. As also demonstrated in [46], analyses

based on the concept of mathematical independence are not well adapted

for fMRI data. Surprisingly, ICA failed to separate components in experi-

ments in which they were designed to be independent. Indeed, the key factor

184

for the success of ICA-type decompositions, as applied to fMRI, was iden-

tified as the sparsity of the components, rather their mutual independence.

More importantly, as previously said, from a neuroscientific point of view

there is no physical or physiological reason for the spatial samples to cor-

respond to different activity patterns Sl(v) with independent distributions.

The complexity of the observed brain processes is highly unlikely to be due

to underlying causes that are independent, in the mathematical definition of

the term. There is no physical reason why these latent processes should not

interact in complex and unforseen ways.

We propose to construct a matrix factorization model that explicitly op-

timizes for sparsity, rather than independence. Several neuroscientific as

well as mathematical results support the choice of the sparsity constraint.

As pointed out by Li, Cichocki and Amari [112], [111], sparse represen-

tation can be very effective when used in blind source separation. Sparse

representations is also an area of considerable recent research in the machine

learning and statistics communities, with algorithms such as basis pursuit

[32], LASSO [164], and elastic net [178].

In the field of vision, Field studied the relationship between the statistics

of natural scenes and coding of information in the visual system [63]. He

argues that the brain has evolved strategies for efficient coding by exploiting

the sparse structure available in the input. In particular, it optimizes the rep-

resentation of its visual environment by performing a sparse representation

of natural scenes in overcomplete, wavelet-like “bases” [137]. We continue on

this theme of sparse representation. Our problem is different, however: we

want to extract whole images (see also Donoho, [50]) from spatio-temporal

data, and not local filters, and the bases are time-courses (and spatial-maps)

in our case. Carroll et al. [30] use the elastic net algorithm for predic-

tive modelling of fMRI data with the aim to obtain interpretable activations.

The built-in sparsity regularization of the elastic net facilitates robust, i.e.

185

repeatable, results that are also localized in nature.

Sparsity can be modelled and imposed by choosing appropriate pri-

ors. Many authors (see for example [111], [154]) choose the Laplace or

bi-exponential prior, P (s) ∝ e−β|s|, corresponding to an ℓ1 penalty on the

values of the latent variables (sources), based on convexity arguments. A

classical mathematical formulation of the sparse decomposition problem is

then to setup an optimization functional such as [112]

I = ‖X−AS‖2 + λS

L∑

l=1

N∑

n=1

|sl,n|+ λA‖A‖2 ,

containing an ℓ1 penalty/prior, enforcing sparse representations, where λS,

λA are regularization parameters. Several important results have been ob-

tained for sparse linear models. Donoho and Elad (2003) [51] give an elegant

proof for the equivalence of the ℓ1 and ℓ0 norms and the conditions under

which it holds using a deterministic approach. Under this framework, how-

ever, the signals have to be highly sparse (e.g. having less than five nonzero

entries) for arbitrary, non-orthogonal bases [111]. This can be too strict an

assumption for real-world signals. Li et al. (2004) extend their results from a

mathematical statistics perspective. However, an ℓ1–style regularization may

not always be flexible enough as a penalty in practice, as it is not adaptive

to the sparsity of the particular latent signals at hand. Adaptive estimation

may lead to very sparse representations. We discuss several alternatives in

section 5.2.2.

When both the basis and the hidden signals are unknown, the problem

becomes harder. Several dictionary-learning algorithms for sparse sepresenta-

tion have been proposed around the ‘FOCal Underdetermined System Solver’

(FOCUSS) algorithm, leading to sparsity by optimizing the ‘diversity’ mea-

sure, which the number of non-zero elements in a sequence. In particular,

186

Kreutz-Delgado et al. [100] improve on this idea and provide maximum likeli-

hood and maximum a-posteriori algorithms using the notion of Schur concav-

ity, i.e. concavity combined with permutation invariance. However, generally

a locally optimal solution can be obtained, as pointed out by Li et al. [111].

In Li et al. [112] learning of the basis, B = al, was performed as an ex-

ternal step, via the k–means algorithm. A solution was then obtained using

linear programming. Uncertainties in the model parameters and model selec-

tion issues were not given full consideration. More realistic models should

include a way for handling noise and uncertainty, however, and seamlessly

fuse information from other parts of the model. In this paper we approach

the problem from a Bayesian perspective and propose a fully Bayesian hier-

archical model for sparse matrix factorization.

Bayesian inference provides a powerful methodology for machine learn-

ing by providing a principled method for using our domain knowledge and

allowing the incorporation of uncertainty in the data and the model into

the estimation process, thus preventing ‘overfitting’. The outputs of fully

Bayesian inference are posterior probability distributions over all variables

in the model.

A key property characterizing sparse mixtures is the concentration of

data points around the directions of the mixing vectors [176]. This phe-

nomenon is depicted in Fig. 5.2. It is this property that allows some algo-

rithms to exploit clustering in order to estimate those ‘unmixing’ directions

from the data (see e.g. Li et al. [111]). However, this property of the data

may not be immediately evident, as the sparsity of the observations often

only emerges in a different basis representation. In this text we consider

an extension to the view of analysis into components as data representa-

tion presented above by forming a representation to an intermediate, feature

space (Zibulevsky et al., [177]; Li et al., [111]; Turkheimer et al., [168]). This

is in essence a structural constraint, consistent with the Bayesian philoso-

187

Figure 5.2: Geometric interpretation of sparse representation. State-spacesand projections of two-dimensional datasets are shown. For visualizationpurposes, both observation and latent dimensionalities are equal to 2 in thisfigure. Left: non-sparse data. Right: Sparse data. Sparse data mapped inthe latent space produce a heavy-tailed distribution for both axes. (Figurefrom D. J. Field, Phil. Trans. R. Soc. Lond. A, 1999.)

phy of data analysis, in the sense that the features/building blocks (also

called ‘primitives’, in pattern recognition terminology) match our notion of

the desired properties of the sought-for signals. The model we consider here,

therefore, is a sparse linear model in which also the observations are affected

to have a sparse representation in an appropriate signal dictionary. We derive

a “hybrid”, feature-space sparse component analysis model, transforming the

signals into a domain where the modeling assumption of sparsity of the ex-

pansion coefficients with respect to the dictionary is natural [176], [91]. The

above idea gives us the advantage to derive a component analysis algorithm

that enforces leptokurtic5 (also known as super-Gaussian) distributions on

5The kurtosis of a probability density is a shape descriptor of the density and, inparticular, a measure of its “peakedness”. It can be defined in terms of its moments, as

µ4

(σ2)2, where µ4 = E

(x − µ)4

, σ2 = E

(x− µ)2

, and µ = E x. From this, the excess

188

the sources. Utilizing the central limit theorem, we can be sure that since

the observations (i.e. intermediate variables here) are described by highly-

peaked distributions centered at zero, the “true” sources will be described

by even more highly-peaked ones. In addition to the sparsity constraint

on the spatial maps, we will seek a parsimonious representation in the sense

that we will require the set of ‘active’ (relevant) regressors describing each

dataset to be minimal.

We derive a sparse decomposition algorithm viewing bilinear decomposi-

tion as a Bayesian generative model. In contrast to most component analysis

models for fMRI, where an unmixing matrix is learned and from there the

sources are deterministically estimated, our model solves a bilinear problem,

simultaneously performing inference and learning of the basis set and the

latent variables. By adopting a fully Bayesian approach, this component

analysis model is a probabilistic model proper: it describes the statistics of

the observation and source signals, instead of some instances of those sig-

nals. We employ hierarchical source and mixing models, which result in

automatic regularization. This allows us to perform model selection in order

to infer the complexity of the decomposition, as well as automatic denoising.

The model also contains an explicit noise model, making the relation of the

sources to the observations stochastic. The benefit of this is that obser-

vation noise is prevented from “leaking” into the estimated components, by

effectively utilizing an implicit filter automatically learned by the algorithm.

We follow a graphical modelling formalism viewing sparse matrix fac-

torization as a probabilistic generative model. Graphical models are repre-

sentations of probability distributions utilizing graph theory, with semantics

reflecting the conditional probabilities of sets of random variables.

In many modern real-world applications of machine learning, such as

kurtosis, µ4

(σ2)2− 3, can be computed. The excess kurtosis of the Gaussian distribution is

zero. A distribution with positive excess kurtosis is called leptokurtic.

189

biosignal analysis or geoscientific data processing, the observation matrix of-

ten comprises of very large and high-dimensional datasets, e.g. observation

dimensions on the order of a few hundreds and hundreds of thousands of

samples. This makes data decomposition/separation very computationally

demanding. A tradeoff should be made, therefore, between mathematical

elegance and computational efficiency. Here, we propose a practical model

under a fully Bayesian framework that captures the requirement of sparse

decomposition while providing analytic expressions for the estimation equa-

tions. Since exact inference and learning in such a model is intractable,

we follow a variational Bayesian approach in the conjugate-exponential fam-

ily of distributions, for efficient unsupervised learning in multi-dimensional

settings, such as fMRI6. The sparse bilinear decomposition algorithm also

provides an alternative to the LASSO [164] and to linear programming or gra-

dient optimisation techniques for blind linear inverse problems with sparsity

constraints.

This text is structured as follows. We first introduce SMF as a general

linear decompositional model for spatio-temporal data under view (1). Then

we present the hybrid wavelet component analysis model emphasizing view

(2) and various ways of modelling sparsity in a graphical modelling frame-

work, as well as the details of inference in this model. Finally, we present

some representative results followed by conclusions and discussion.

5.2 Bayesian Sparse Decomposition Model

We start by forming a representation to an “intermediate” space spanned

by a wavelet family of localized time-frequency atoms. The use of wavelets

in neuroimaging analysis has become quite widespread in recent years, due

6See, however, the article of Woolrich et al. [171] for an alternative fully Bayesianalgorithm, using MCMC sampling.

190

to the well-known property of wavelet transforms to form compressed, mul-

tiresolution representations of a very broad class of signals. Sparsity with

respect to a wavelet dictionary means that most coefficients will be “small”

and only a few of them will be significantly different than zero. Furthermore,

due to the excellent approximation properties of wavelets, in the standard

‘signal plus noise’ model, decomposing data in an appropriate dictionary will

typically result in large coefficients modelling the signal and small ones cor-

responding to noise. In addition, the wavelet transform largely decorrelates

the data. The above properties should be captured by the model.

Following Turkheimer et al. [168], we perform our wavelet analysis in the

spatial domain; for examples of the use of wavelets in the temporal dimension

see, for example, Bullmore et al. [21]. Roussos et al. [152], [153] and Flandin

and Penny [66] have also used spatial wavelets for decomposing fMRI data

using variational Bayesian wavelet-domain ICA and GLMs, respectively. The

representation of a signal, f , is formed by taking linear projections of the data

on the atoms, ψλ, of the dictionary, Φ, which, for the sake of simplicity, is

assumed for now to be orthonormal and non-redundant (consequently, Φ−1 =

ΦT) . These projections are the inner products, that is the correlations, of

the corresponding signals (functions) with each ψλ ∈ Φ, ∀λ. Then, the

wavelet expansion of the signal is f = Φw, where wλ = 〈f ,ψλ〉 is the λth

wavelet coefficient and 〈·, ·〉 here denotes the inner product. Wavelets form

unconditional bases7 for a class of smoothness function spaces called Besov

spaces. By transforming the signals in wavelet space, and regarding small

wavelet coefficients as “noise”, we essentially project signals in smoothness

space (Choi and Baraniuk, [34]). Using this representation for all signals

7A basis gk is unconditional if the series∑

k wkgk converges unconditionally (forfinite wk). The unconditionality property allows us to ignore the order of summation.

191

in the model, we get the noisy observation equation in the wavelet domain

X = AC + E , (5.3)

where the matrix[xT

t

]T

t=1denotes the transformed observations,

[cT

l

]L

l=1the

(unknown) coefficients of the wavelet expansion of the latent signals[sT

l

]L

l=1,

and E ∼ N (0T×N , (R−1IT , IN)) is a Gaussian noise process. Now separation

will be performed completely in wavelet space. After inference, the compo-

nents will be transformed back in physical space, using the inverse transform.

Furthermore, we can use any of the standard wavelet denoising algorithms

in order to clean the signals. We note, however, that as we have an explicit

noise process in the component analysis model, small wavelet coefficients are

explained by the noise process rather than included in any reconstructed

source. We thus achieve a simultaneous source reconstruction and denoising.

5.2.1 Graphical modelling framework

The probabilistic dependence relationships in our model can be represenented

in a directed acyclic graph8 (DAG) known as a Bayesian network. In this

graph, random variables are represented as nodes and structural relationships

between variables as directed edges connecting the corresponding nodes. In-

stantiated nodes appear shaded. Repetition over individual indices is denoted

by plates, shown as brown boxes surrounding the corresponding random vari-

ables. The graphical representation offers modularity in modelling and effi-

cient learning, as we can exploit the local, Markovian structure of the model,

captured in the network, as will be shown next.

DAGs are natural models for conditional independence. In particular,

8A directed acyclic graph is a graph with no cycles. A cycle is a path, i.e. a sequenceof nodes, in which the end points are allowed to be the same.

192

let the random variables in a probabilistic model belong to a set V. Our

goal is to compute the joint probability p(V). Each arc, which is an ordered

pair (v, u) ∈ E ⊆ V × V, represents a dependence relation between v and u,

denoted by an arrow: u −→ v. A graph, G, is then the pair (V, E). According

to our probabilistic specification, a variable v ∈ V may depend on a subset

of variables in V, called its ‘parents’ and denoted by pa(v) ⊆ V. The parent

set is the smallest set for which p(v|V \ v) = p(v|pa(v)). We endow the

DAG with probabilistic semantics by identifying the dependence relation

with the conditional probability p(v|pa(v)). Each individual parent-child

relation, defined on the ordered pairs (u, v) with u ∈ pa(v) is then an edge in

a Bayesian network. A probability density, p, on V is said to be compatible

with a Bayesian network if it satisfies all independence relationships among

nodes implied by the DAG. The important point here is that there is a natural

topological structure in V, induced by the dependence relation ‘v depends on

pa(v)’. This topological structure allows us to efficiently compute the joint

distribution, p(V), as a structured probability distribution on the DAG, G.In particular, the following directed Markov property holds:

p(V) =∏

v∈Vp(v|pa(v)) , (5.4)

which allows us to factorize p(V) using only local relations: each random

variable “is influenced only by its parents”, or more formally, each node is

conditionally independent from its non-descendents given its parents. Note

that this factorization is exact and follows from the graphical model specifi-

cation; no simplifications were made here.

The generic graphical model for feature-space component analysis in

shown in Fig. 5.3. The latent variables (unknown coefficnents of the “sources”)

are denoted bycl,λL

l=1and the “observations” (i.e. the data transformed in

193

c1,λ cl−1,λ cl,λ cL,λ

x1,λ xt,λ xT,λ

a1 at aT

R

λ = 1, . . . ,Λ· · · · · ·

· · · · · ·

Figure 5.3: The probabilistic graphical model for feature-space component

analysis. The bi-partite graph,cl,λL

l=1−→

xt,λ

T

t=1, where λ indexes

the features, will play an essential role in this model. The brown rectangle

represents a ‘plate’: it captures repetition over λ. The nodesat

T

t=1, where

each at is a vector of the form at = (a1,t, . . . , al,t, . . . , aL,t), represent time-“slices” of the time-courses across all latent dimensions. Finally, R is theinverse noise level on the (transformed) observations.

feature space) byxt,λ

T

t=1, where λ is a particular data point, λ = 1, . . . ,Λ.

The observations are modelled as a linear mixture of the sources, plus noise,

via an unknown linear operator A (whose rows are aT

1 , . . . , aT

t , . . . , aT

T ), such

that xt,λ ≈∑L

l=1At,lclλ.

A point worth emphasizing, which applies to all component analysis prob-

lems, is that the assumption of independence over the latent variables, l,common to many ICA algorithms, refers to the a-priori source distribution,

i.e. the model density p(s), regardless of whether it is explicitly stated or not.

The sources given the data, p(s|x), that is after we have learned the model,

will be coupled a-posteriori. This is the ‘explaining-away’ phenomenon, and it

is probably best seen by observing the directed graph (Bayesian network) that

corresponds to a component analysis model: see Fig. 5.3. The dependency

properties (semantics) of this Bayesian network, i.e. the ‘fan-in’ structure,

imply that, while one may chose a factorized, over l, prior model for the latent

variables, p(s) ,∏

l pl(sl), implying a-priori statistical independence, these

become coupled, i.e. dependent after we obtain the data: p(s|x) 6= ∏l pl(sl|x).

194

Since probabilities must sum to one, the sources slLl=1, “compete” in or-

der to explain the data. While for some applications a factorized posterior

might be sufficient9, for other applications, such as extracting brain processes

from neuroimaging data, the independence assumption might not be valid.

Nevertheless, many ICA algorithms make this additional factorization (naive

mean-field) assumption anyway [102], [36], [128]. However, Ilin and Valpola

[93] discuss the consequences of this assumption. Therefore, in our model we

will employ a non-factorized posterior.

The complete graphical model for VB-SMF shown in Fig. 5.4. To

fully specify the model, the probabilistic specification (priors) for all random

variables in the model needs to be given. The learning algorithm then infers

the wavelet coefficients, cl,λ, ∀l, λ, and learns the time-courses, At,l, ∀t, l,the parameters of the sparse prior on the coefficients, and the noise level.

These four components of the model are shown as dotted boxes in Fig. 5.4.

5.2.2 Modelling sparsity: Bayesian priors and wavelets

Adaptive sparse prior model for the wavelet coefficients. The char-

acteristic shape of the typical empirical histogram of the wavelet coefficients

of the spatial maps,cl,λ, is highly peaked at zero and heavy tailed. We

want to capture this sparsity pattern of the coefficients in probabilistic terms.

A classical prior for wavelets is the Laplacian distribution, corresponding to

an ℓ1 norm, or the generalized exponential (also called generalized Gaussian)

distribution, corresponding to an ℓp norm [91]. Another sparse prior that

has been used is the Cauchy-Lorentz distibution10 [136]. In BIT-SCA, we

9For example, one might claim that in the cocktail party problem the voices of thepeople are generated by passing a white noise process, generated from the lungs, throughthe vocal cords, independently modulating each individual’s output signal.

10The Cauchy distribution has the probability density function f(x; µ, γ) =1/πγ[1 + ((x − µ)/γ)2] where µ is the location parameter, specifying the location of thepeak of the distribution, and γ is the scale parameter specifying the half-width at half-

195

πlm

µlm

βlm

ξlλ

clλ

xtλ

Atl

al

R

L

T

Λ

M

M

απl,m

(bal, cal

)

(mµl,m, vµl,m

)

(bβl,m, cβl,m

)

(bR, cR)

Figure 5.4: Variational Bayesian Sparse Matrix Factorization graphicalmodel. In this graph, nodes represent random variables and edges proba-bilistic dependence relationships between variables. Instantiated nodes ap-pear shaded and nodes corresponding to hyperparameters are represented bysmaller, filled circles. We can think of uninstantiated nodes as “uncertaintygenerators”. Repetition over individual indices is denoted by plates, shownas brown boxes surrounding the corresponding variables. Each plate is la-beled with its corresponding cardinality. If two or more plates intersect, theresulting tuple of indices is in the cartesian product of their respective indexsets. Finaly, each module in the model is shown as a dotted box.

196

0

0.5

1

1.5

2

2.5

-3 -2 -1 0 1 2 3P

DF

p(c

)

wavelet coefficient, c

Figure 5.5: Sparse mixture of Gaussians (SMoG) prior for the coefficientscl,λ. Blue/green curves: Gaussian components; thick black curve, mixturedensity, p(cl,λ).

used the generalized exponential model for wavelet coefficients. While this

is a flexible prior, especially under our model (where we estimate the hy-

perparameters from the data), it does not, unfortunatelly, lead to analytic

expressions. Therefore, we can not use it under the full VB framework.

Our aim is to model a wide variety of sparseness constraints in a tractable

(analytic) way and at the same time derive an efficient implementation of our

method. In order to achieve this, we use distributions from the conjugate-

exponential family of distributions. The conjugacy property means that the

chosen priors are conjugate to the likelihood terms, leading to posteriors that

have the same functional form as the corresponding priors. One option is to

use a ‘sparse Bayesian’ prior [166], which corresponds to a Student-t marginal

for each p(cl). However, this leads to an algorithm that requires O(LN) scale

parameters for the wavelet coefficients. In our implementation we enforce

sparsity by restricting the general mixture of Gaussians (MoG) model to be

a two-component, zero co-mean mixture over each set cl,λΛλ=1, for each l

(Fig. 5.6). A two-state, zero-mean mixture model is an effective distribution

for modelling sparse signals, both emprirically and theoretically [34], [41].

maximum (HWHM).

197

πl,m

µl,m

βl,m

ξl,λ

cl,λ

λ

l :

m

m

Figure 5.6: A directed graphical model representation of a sparse mixture ofGaussian (SMoG) densities. In this model we put a separate MoG prior onthe expansion coefficients of the lth spatial map,

cl,λ

λ. The plate over λ

represents the collection(cl,λ, ξl,λ)

λ, of pairs of state and wavelet coefficient

values for the coefficients of lth source, for each “data point” λ. The platesover m represent the parameters of the set of M Gaussian components, θcl

=

µl,m, βl,m

m

⋃

πl,m

m, which collectively model the non-Gaussian density

of the coefficients of the source l using Eq. (5.5). The means, µl,m, arenon-zero for the scaling coefficients only; for the wavelet coefficients, we setµl,m

.= 0.

The wavelet coefficients have respective state variables ξl,λΛλ=1. The

parameters of the component Gaussians are the means µl,m and the precisions

βl,m, i.e. p(cl,λ|ξl,λ, µl,m, βl,m) = N (cl,λ;µl,m, βl,m), and the associated mixing

proportions of the MoG are πl,m, where m denotes the hidden state of the

mixture that models the lth source. These form a concatenated parameter

vector θclindexed by m = 1, . . . ,M (with M

.= 2 here). The mixture

density of the expansion coefficients of the lth spatial map is then given by

p(cl,λ|θcl) =

M∑

m=1

p(ξl,λ = m|πl)p(cl,λ|ξl,λ,µl,βl) . (5.5)

198

Since we are interested in sparsity, we set µlm.= 0, ∀l,m, for the wavelet

coefficients (but not for the scaling ones11), a-priori. The prior probability

of the state variable ξl,λ being in the mth state is p(ξl,λ = m|πl) = πl,m. The

joint prior on the wavelet coefficients given ξ is therefore:

p(C|ξ,µ,β) =L∏

l=1

[Λ∏

λ=1

p(cl,λ|ξl,λ, µlm, βlm)

]

=L∏

l=1

[Λ∏

λ=1

M∑

m=1

πl,mN (cl,λ;µl,m, βl,m)

]

(5.6)

and the prior on the hidden state variables, ξ =ξl

L

l=1, where ξl = (ξl,1, . . . , ξl,Λ),

is

p(ξ = m|π) =∏

l,λ

p(ξl,λ = m|πl) =∏

l,λ,m

πl,m . (5.7)

The assignment of hyperpriors is very important for capturing the sparsity

constraint. The prior hyperparameters of the two components, have zero

mean and hyperpriors over the precisions such that one component has a

low precision, the other a high precision. These correspond to the two states

of the wavelet coefficients, ‘large’ (carrying signal information) and ‘small’

(corresponding to “noise”). Figure 5.5 depicts this scheme. We assign a

Gaussian hyperprior on the position parameters µl,

p(µl) =M∏

m=1

N(µl,m;mµl,0

, vµl,0

−1), (5.8)

a Gamma on the scale parameters βl,

p(βl) =

M∏

m=1

Ga(βl,m; bβl,0, cβl,0

) , (5.9)

11The scaling coefficients can be thought of as the result of a low-pass type filteringand represent a smoothed version of the original signal. These are not necessarily sparse.Therefore, we allow a general MoG prior for them. Another option is to put a flat prioron them.

199

where we use the parameterization x ∼ Ga(x; b, c) = 1Γ(b)bcx

c−1e−x/b, and a

Dirichlet on the mixing proportions πl,

p(πl) =M∏

m=1

Di(πl,m;απl,0) . (5.10)

Note that the sparse MoG (SMoG) model parameters are not fixed in

advance, but rather they are automatically learned from the data, adapting

to the statistics of the particular spatial maps. In particular, the parameters

β are not fixed but are allowed to range within the “fuzzy” ranges (‘small’,

‘large’) based on the definition of their hyperparameters12 (bβ, cβ). Using

this flexible model we can adapt to various degrees of sparsity and different

shapes of prior.

5.2.3 Hierarchical mixing model and Automatic Rele-

vance Determination.

For the entries of the factor A we choose a hierarchical prior that will allow

us to select the most relevant column vectors to explain a particular data

set, X . In particular, the prior over the timecourses, alLl=1, is a zero-mean

Gaussian with a different precision hyperparameter al over the lth column

vector, al: p(al|al) = N(al; 0T×1, a

−1l IT

), for l = 1, . . . , L. Therefore, the

joint prior over the bases is

p(alLl=1|a

)=

L∏

l=1

p(al|al) =

L∏

l=1

N(A:,l; 0T×1, a

−1l IT

), (5.11)

where A:,l denotes the lth column vector of entries for the time-points t =

12Also note that for b0β → 0 and c0

β →∞, the Gamma prior becomes the uninformative

prior p(β) ∝ 1β .

200

1, . . . , T and the components of the precision a = (a1, . . . , al, . . . , aL), cor-

responding to the column vectors[al

]L

l=1of the mixing matrix (and, conse-

quently, to the spatial mapssl

L

l=1), are the same for all entries: atl = al, ∀t.

The prior over each al is in turn a Gamma distribution, p(al) = Ga(al; bal, cal

).

This hierarchical prior leads to a sparse marginal distribution for alLl=1 (a

Student-t, which can be shown if one integrates out the precision hyperpa-

rameter, al). By monitoring the evolution of the al, the relevance of each

time-course may be determined; this is referred to as Automatic Relevance

Determination (ARD) [115]. This allows us to infer the complexity of the

decomposition and obtain a sparse matrix factorization in terms of the time-

courses as well, by suppressing irrelevant sources.

5.2.4 Noise level model

Finally, the prior distribution over the precision, R = RIT , of the noise

process, E, (i.e. the inverse of the noise level) is a gamma density with

parameters (bR, cR):

p(R|bR, cR) = Ga (R; bR, cR) =1

Γ (cR) (bR)cRRcR−1e−1/bRR . (5.12)

5.3 Approximate Inference by Free Energy

Minimization

5.3.1 Variational Bayesian Inference

In the section above, we stated a generative model for sparse bilinear decom-

position. Let us now collect all unknowns in the set U =π,µ,β, ξ,C,a,A,R

.

Exact Bayesian inference in such a model is extremely computationally inten-

201

sive and often intractable, since we need, in principle, to perform integration

in a very high dimensional space, in order to obtain the joint posterior of Ugiven the data, p(U|X). Instead, we will use the variational Bayesian (VB)

framework [7], for efficient approximate inference in high-dimensional set-

tings, such as fMRI. The idea in VB is to approximate the complicated exact

posterior with a simpler approximate one, Q(U), that is closest to p(U|X)

in an appropriate sense, in particular in terms of the Kullback-Leibler (KL)

divergence. The optimization functional in this case is the (negative) varia-

tional free energy of the system:

F(Q,X ) =⟨log p(X ,U)

⟩+H

[Q(U)

], (5.13)

where the average, 〈·〉, in the first term (negative variational ‘energy’) is over

the variational posterior, Q(U), and the second term is the entropy of Q(U).

The negative free energy forms a lower bound to the Bayesian log evidence,

i.e. the marginal likelihood of the observations. This is probability of the ob-

servations being instantiated at a particular value, after all latent variables

and parameters in the model have been integrated out. Maximizing the

bound minimizes the “distance” between the variational and the true poste-

rior. Using this bound we derive an efficient, variational Bayesian algorithm

for learning and inference in the model that provides posterior probability

distributions over all random variables of the system, i.e. the source densities,

mixing matrix, source model parameters, and noise level.

We choose to restrict the variational posterior Q(U) to belong to a class

of distributions that are factorized over subsets of the ensemble of variables:

Q(U) =

(Λ∏

λ=1

Q(cλ)Q(ξλ)

)(

Q(π)Q(µ)Q(β))(

Q(A)Q(a))

Q(R) . (5.14)

The above factorization incorporates the standard variational approximation

202

between the latent variables and parameters,

p(

cλ, ξλ

Λ

λ=1; θ∣∣∣X)

≈ Q(θ)Λ∏

λ=1

Q(cλ, ξλ) . (5.15)

That is, the distribution of latent variables factorizes across the λ–plate.

However here, unlike e.g. [152] or [149], we will employ variational posteriors

that are coupled across latent dimensions for the wavelet coefficients of the

spatial maps and the time-courses13.

Performing functional optimization with respect to the distributions of

the unknown variables, ∂∂Q(u)

F(Q,X ) = 0, for all elements u of U , we obtain

the optimal form for the posterior:

u ∈ U : Q(u) ∝ exp

(⟨

log p(X,U)⟩

Q(U\u)

)

. (5.16)

This results in an system of coupled equations, which are solved in an iterative

manner. Theoretical results [7] show that the algorithm converges to a (local)

optimum. Since we have chosen to work in the conjugate-exponential family,

the posterior distributions have the same functional form as the priors and

the update equations are essentially “moves”, in parameter space, of the

parameters of the priors due to observing the data. It turns out that the

structure of the equations is such that, due to the Markovian structure of

the DAG, only information from the local “neighborhood” of each node is

used. Exploiting the graphical structure of the model, the resulting update

equations depend only on posterior averages over nodes belonging to the

Markov blanket of each node. This is the set comprising of the parents, the

children, and the co-parents with respect to the particular node at hand.

13ICA decompositions can also be sparse, if sparsity-enforcing source models are used.However, the SCA decompositions presented here are non-independent. (We have alreadyshown the difference between a-priori and a-posteriori coupling in the bi-partite graphicalmodel for CA.)

203

We next show the update equations for the wavelet coefficients of the spatial

maps and the time-courses.

5.3.2 Posteriors for the variational Bayesian sparse ma-

trix factorization model

Inferring the wavelet coefficients of the spatial maps, C

Equation (5.3) combined with Eq. (5.15) leads to a system of equations, one

for each column, λ, of length L, linking the data and source vectors

λ : xλ = Acλ + ελ =∑

l

cl,λal + ελ, λ = 1, . . . ,Λ = |Φ| , (5.17)

where cλ is the vector of the expansion coefficients cλ = (c1,λ, . . . , cL,λ). This

vector is represented as the second layer of nodes in the graphical model of

Fig. 5.3. In the (deterministic) solution of the sparse model of Li et al. [111]

via linear programming, this has the rather convenient consequence that

we only need to solve N smaller linear programming problems in order to

estimate the sparse coefficients. An analogous result is obtained in the vari-

ational Bayesian model as well. In particular, the variational posterior has a

Gaussian functional form, N(

C; µL×Λ, βΛ×L×L

)

, with mean and precision

parameters for the λth wavelet coefficient vector, cλ, given by:

µλ =(

βλ

)−1[

µλ +⟨AT⟩〈R〉 xλ

]

and (5.18)

βλ = βλ +⟨

ATRA⟩

. (5.19)

The quantities µλ and βλ are ‘messages’ sent by the parents of the node cλ

204

to it and are computed by

µl,λ =M∑

m=1

γlλm 〈βlm〉〈µlm〉 , βl,λ =M∑

m=1

γlλm 〈βlm〉 . (5.20)

The weighting coefficient γlλm, called ‘responsibility’, encodes the prob-

ability of the mth Gaussian kernel generating the λth wavelet coefficient of

the lth spatial map. It is defined as the posterior probability of the state

variable ξl,λ being in the mth state:

γlλm ≡ Q (ξlλ = m) , m = 1, . . . ,M . (5.21)

This will be given in Eq. (5.25).

The rest of the update equations for the posteriors of the sparse MoG

model, ξ,µ,β,π, take a standard form and are shown next, in Eqns (5.25)

to (5.42).

Learning the time-courses and their precisions,A,a

The variational posterior over the matrix of time-courses, AT×L, is a product

of Gaussians with mean and precision parameters for the tth row of A given

by

at =

[

01×L + 〈R〉(

Λ∑

λ=1

⟨

xtλcT

λ

⟩)](

Γat

)−1

, (5.22)

Γat = diag(〈a〉)

+ 〈R〉(

Λ∑

λ=1

⟨

cλcT

λ

⟩)

. (5.23)

The variational posterior of the precisions a = (al) is given by a product of

205

Gamma distributions with al ∼ Ga(

al; bal, cal

)

with variational parameters

bal=

(

1

ba0

+1

2

T∑

t=1

⟨Atl

2⟩

)−1

, cal= ca0 +

1

2T , (5.24)

for the lth column of A. The posterior means of the precisions al, 〈al〉 = balcal

,

are a measure of the relevance of each timecourse, al.

Inferring the states, ξ, of the wavelet coefficients and learning the

parameters of the wavelet model, θC

The responsibilities, γlλm, i.e. the posterior probabilities that the wavelet

coefficient cl,λ was produced by the mth Gaussian component (captured by

the random event ξl,λ = m, where ξl,λ is the state of the wavelet coefficient),

γlλm = Q (ξl,λ = m), are given by

γlλm =γlλm

Zθcl

, (5.25)

where

γlλm = πl,m ·[

β1/2l,m exp

(

−〈βl,m〉2

⟨

(clλ − µl,m)2⟩

Q(cl,µl)

)]

(5.26)

and

Zθclj=∑

m′

γlλm′ , (5.27)

ensuring that Q(ξl,λ) is a properly scaled probability density. The tilded

quantities in the above equations are the exponential parameters

πlm = e〈log πlm〉Q (5.28)

206

and

βlm = e〈log βlm〉Q . (5.29)

These are computed by

e〈log πl,m〉Q = exp

(

Ψ(απl,m)−Ψ

(∑

m′

απl,m′

))

(5.30)

and

e〈log βl,m〉Q = bβl,mexp(Ψ(cβl,m

)), (5.31)

where Ψ(·) is the digamma function, Ψ(x) , ddx

log Γ(x) = Γ′(x)Γ(x)

. Note that

in equation (5.26), the first factor comes from the parents,πl,m

M

m=1, of

the node ξl,λ, and is related to its prior, while the second factor (in square

brackets) comes from the children, and is related to the ‘likelihood’. In

particular, we can interpret the above expression as the node πl,m “sending

a message” πl,m to the state variable, ξl,λ, while the children and co-parents

sending the expression in brackets.

The wavelet model parameters, θC =π,β,µ

, are updated according

to the following update equations for their corresponding posteriors:

• Variational posterior of the mixing proportions, Q(π):

απl,m= απl,0

+ Λl,m , (5.32)

where Λl,m is given below, in Eq. 5.39.

• Variational posterior of the means, Q(µ): For the scaling coefficients,

we have

mµl,m=

(vµl,m

)−1[

vµl,0mµl,0

+ vcl,mmcl,m

]

, (5.33)

vµl,m= vµl,0

+ vcl,m, (5.34)

207

where the “data-dependent” (children) part, that is the information

coming from the wavelet coefficients cl, is computed as

mcl,m=

cl,mπl,m

, (5.35)

vcl,m= Λl,m 〈βl,m〉 . (5.36)

For the wavelet coefficients, we set µ.= 0. This enforces the sparsity

constraint as well.

• Variational posterior of the precisions, Q(β):

bβl,m=

(1

bβl,0

+1

2Λσ2

cl,m

)−1

, (5.37)

cβl,m= cβl,0

+1

2πl,mΛ . (5.38)

The above update equations make use of the following responsibility-

weighted posterior statistics of the set of wavelet coefficients cl = cl,λ,

πl,m =Λl,m

Λ=

1

Λ

Λ∑

λ=1

γlλm , or Λl,m = πl,mΛ , (5.39)

where Λl,m can be interpreted as a pseudo-count of the number of wavelet

bases that can be attributed to the m–th Gaussian kernel and

cl,m =1

Λ

Λ∑

λ=1

γlλm 〈cl,λ〉 , (5.40)

c2l,m =1

Λ

Λ∑

λ=1

γlλm

⟨cl,λ

2⟩, (5.41)

208

σ2cl,m

=1

Λ

Λ∑

λ=Λ

γlλm

⟨(cl,λ − µl,m)2⟩ , (5.42)

where

cl,m, c2l,m, σ

2cl,m

can be interpreted as the contribution of component

m to the average, second moment, and centered second moment (variance)

of the source l and πl,m, Λl,m are the proportion and number of samples that

are attributed to component m of source l, respectively.

From the above equations we can see that there is a natural hier-

archy in the computations of the update equations. Note first that the

responsibility-weighted statistics, equations (5.39)–(5.42) above, are equiv-

alent to the maximum-likelihood estimates of the mixture model via the

EM algorithm, the difference being that we use averages of clλ with respect

to its variational posterior here. Referring to the graphical model for the

SMoG, Fig. 5.6, these are the “inputs” that are sent from the children nodes,clλΛ

λ=1, to the MoG model for the lth source, and they are the sufficient

statistics, (〈clλ〉 , 〈clλ2〉)Λλ=1.

Learning the noise model parameter, R

Finally, the noise precision has a Gamma posterior distribution, R ∼ Ga(

R; bR, cR

)

,

with hyperparameters

bR =

[1

bR0

+1

2

⟨

tr((

X−AC)(

X−AC)T)⟩]−1

, cR = cR0 +1

2TΛ .

(5.43)

The node R receives exactly TΛ messages from its children and co-parents,

as expected.

209

5.3.3 Free Energy

From the definition of the negative free energy, we have

F =⟨

log p(U ,X︸︷︷︸

V(G)

)⟩

Q(U)+H

[Q(U)

],

where the set of unknowns is comprised of the latent variables and parame-

ters, U = Y ∪ θ. Using the standard variational factorization,

Q(U) = Q(Y)Q(θ) ,

and the Markovian structure of the graph, G, with node-set V(G), where

each node, v, has a probability distribution conditioned on its parents,

G→ (p,G) ⇒ p(V(G)) =∏

v∈Vp(v|pa(v)) (Markov) ,

the free energy takes the following equivalent forms

F =⟨

log p(X ,Y|θ)⟩

Q(Y ,θ)+⟨

log p(θ)⟩

Q(θ)+H

[Q(Y)

]+H

[Q(θ)

](5.44)

and

F =⟨

log p(X ,Y|θ)⟩

Q(Y ,θ)+H

[Q(Y)

]+ KL(Q(θ)||p(θ)) , (5.45)

formulated in terms of the entropies of the latent variables and the KL di-

vergencies of the parameters. In a Bayesian network formulation of a prob-

abilistic model, each node gives a term in F .

Applying the above to the Bayesian network of the VB-SMF model, shown

in Fig. 5.4, we obtain the negative free energy of our model. Note that, after

210

the algorithm has converged, several terms cancel out (see, for example, [37]).

The NFE is then:

F = F(X,R) + F(C,ξ) + F(A,a) + FSMoG , (5.46)

where

FX,R = −1

2TΛ log 2π + T

[

log

(Γ(cR)

Γ(c0R)

)

+ cR log bR − c0R log b0R

]

(5.47)

F(C,ξ) =

Λ∑

λ=1

1

2L− 1

2log∣∣∣βλ

∣∣∣

−L∑

l=1

Λ∑

λ=1

M∑

m=1

γl,λ,m log γl,λ,m (5.48)

F(A,a) =

T∑

t=1

1

2L− 1

2log∣∣∣Γat

∣∣∣

+

L∑

l=1

log

(Γ(cal

)

Γ(c0a)

)

+ callog bal

− c0a log b0a

(5.49)

FSMoG =L∑

l=1

M∑

m=1

log

(

Γ(cβl,m)

Γ(c0βl)

)

+ cβl,mlog bβl,m

− c0βllog b0βl

+(5.50)

L∑

l=1

M∑

m=1

log

(Γ(απl,m

)

Γ(α0πl

)

)

− log

Γ(∑M

m=1 απl,m

)

Γ(∑M

m=1 α0πl

)

.(5.51)

The negative free energy is used, apart from deriving the update equations,

to monitor the convergence of the algorithm.

5.4 Results

In this Section we present results from using three real fMRI data sets: (1)

a single subject (epoch) block-design auditory dataset14, (2) a block-design

14From G. Rees, K. Friston and the FIL methods group, Wellcome Trust Centre for Neu-roimaging, University College London, www.fil.ion.ucl.ac.uk/spm/data/auditory/).

211

auditory-visual data set15, and (3) the block-design visual dataset used by

Friston, Jezzard, and Turner in [71] and described in Sect. 3.4.3. These

data sets were used to compare the performance of our model and conven-

tional methods. In particular, results from GLMs were used as the ‘golden

standard’, providing the reference function for the time-courses, while the

Probabilistic ICA model of Beckman and Smith [15] was used as the state-

of-the-art for “model-free” analysis.

5.4.1 Auditory stimulus data set

The task and data aquisition are described next. Auditory stimulation was

with bi-syllabic words presented binaurally at a rate of 60/min. The con-

dition for successive blocks alternated between rest and auditory stimula-

tion, starting with rest. Whole-brain BOLD/EPI images were acquired

on a modified 2.0T Siemens MAGNETOM Vision system and each acqui-

sition consisted of 64 contiguous slices of size 64× 64 voxels with voxel size

3mm × 3mm × 3mm. A total of 96 acquisitions were made in blocks of 6

with the scan-to-scan repeat time set to 7s (TR = 7s), giving 16 blocks of 42s

each. For our analysis, the first four scans were then discarded as “dummy”

scans, due to T1 effects, and from the rest, 84 scans were used.

We run the variational Bayesian sparse matrix factorization (VB-SMF)

model on the dataset in order to detect ‘consistently task related’ (“pre-

dictable”) spatio-temporal components (CTRs). Fig. 5.7 shows the spatial

map and corresponding time-course (red curve), which was selected as the

column of the mixing matrix that best matched the expected time-course

(black curve). The latter was constructed by convolving the experimen-

15From the Oxford Centre for Functional MRI of the Brain,www.fmrib.ox.ac.uk/fsl/fsl/feeds.html).

212

-1.5

-1

-0.5

0

0.5

1

1.5

0 20 40 60 80 100

VBSCAPICAGLM

0

5

10

15

20

Figure 5.7: Time course (left) and corresponding spatial map (right)resulting from applying the variational Bayesian sparse matrix factor-ization model to the auditory fMRI data set of Rees and Friston(www.fil.ion.ucl.ac.uk/spm/data/auditory/). Timecourses: red curve:our model; green curve: PICA; black curve: reference curve from GLM. Theheight of the timecourses on the y–axis were normalized to best match theGLM reference function and range from −1 to 1, approximately. The spatialmap is the raw result from the model (no thresholding post-processing wasperformed). Image intensities range from 0 to about 20.

tal design with the canonical haemodynamic response function from SMP16.

The figure shows that the model correctly identified the neural activation in

the auditory area. The result from VB-SMF was compared with applying

the Probabilistic ICA model on the same data set (green curve). It can be

seen that the VB-SMF better matches the expected timecourse. In partic-

ular, the correlation coefficients were rVB−SMF = 0.801 and rPICA = 0.778,

respectively.

16This has the general form (Friston, [70])

h(t) = α(t/d1)a1e−(t−d1)/b1 − c(t/d2)

a2e−(t−d2)/b2 ,

and is designed such that it takes various neuronal parameters into account (such as theonset, the length of kernel, the delay of response relative to the onset, the dispersion ofresponse, the delay of undershoot relative to the onset, the dispersion of undershoot, andthe ratio of response to undershoot) in a mathematically convenient way. The defaultparameters (0, 32, 6, 1, 16, 1, 6) were used here.

213

5.4.2 Auditory-visual data set

We then tested the sparse decomposition model on the well-known auditory-

visual fMRI data set provided with the FSL FEEDS package [158], used as

benchmark. The data set contains 45 time-points and 5 slices of size 64× 64

voxels each. It was specifically designed as such in order to make it more

difficult to detect the responses. We run our model on the dataset in or-

der to detect ‘consistently task related’ (CTR) components. We applied the

standard preprocessing steps (motion correction, registration, etc.), but no

variance normalization or dimensionality reduction. For each of the sepa-

rated components, we computed the correlation coefficient, r, between the

associated timecourse, al, and the ‘expected timecourses’, which were the

canonical explanatory variables (EVs) from FEAT. After convergence, the

model inferred only L = 3 components with r > 0.3. The component with

the highest value of r was identified as the CTR map. A strong visual and a

strong auditory component were extracted by the model; these are shown in

Fig. 5.8. The correlation coefficients were rvis = 0.858 and raud = 0.764. The

corresponding PICA coefficients from MELODIC were 0.838 and 0.756, re-

spectively. The result of VB-ICA [?], [152] on the same dataset was 0.780 and

0.676, respectively [78]. It is worth noting that the spatial maps extracted

from our model were also much cleaner than both PICA and VB-ICA (not

shown), as displayed in Fig. 5.8. This is due to applying the sparse prior on

the maps.

5.4.3 Visual data set: An Experiment with Manifold

Learning

We now return to the small visual data set of Friston et al. [71], used in

sections 3.4.3 and 4.6.2. The purpose of this experiment is to combine the

214

-2-1.5

-1-0.5

00.5

11.5

2

5 10 15 20 25 30 35 40 45TRs

Auditory stimulus timecourses

-2-1.5

-1-0.5

00.5

11.5

2

5 10 15 20 25 30 35 40 45TRs

Visual stimulus timecourses VB-SR

PICA Canon. EV (GLM)

Auditory spatial map: PICA

Visual spatial map: PICA

Auditory spatial map: VB-SR

Visual spatial map: VB-SR

Figure 5.8: Time courses and corresponding spatial maps resulting fromapplying the variational Bayesian sparse decomposition model to a visual-auditory fMRI data set. Red curve: our model; green curve: PICA; bluecurve: canonical EVs. Note that the maps are the raw results from themodel; no thresholding post-processing was performed.

VB-SMF model for sparse data decomposition with manifold17 learning. Up

to now we have been using data matrices that contained all data-points,

i.e. all scans at time-points t = 1, . . . , T , in their rows, without any dimen-

sionality reduction. In neuroimaging analysis practice, however, using all

observation dimensions is usually impractical and time-consuming. More-

over, as mentioned in Sect. 5.1 and in [15] by Beckmann and Smith, for

most practical purposes, the signal is contained in a lower-dimension sub-

space of the original ambient observation space, the rest containing pro-

cesses that can be characterized as “noise”. However, as recent research

in Machine Learning has shown [114], [83], [42], a better model of the ge-

ometry of high-dimensional datasets is a curved manifold. This approach

was also taken by [173], although the authors did not explicitly stated it as

such. A manifold can potentially capture the intrinsic dimensionality better

than a linear object. Note that we have defined a data point as a whole scan,

xt = X(t,v) : v ∈ V, here. We will continue to be using all voxels and only

17A manifold is a smooth geometric object that can be locally described by a hyperplane,its tangent plane.

215

reduce the dimensionality in the temporal dimension, as is done when using

PCA pre-processing, for example. This essentially means that we implicitly

assume that the activation dynamics can be described by a smaller number of

dynamical parameters than T . As also mentioned in the Introduction of the

Thesis, (spatial) component analysis algorithms applied to spatio-temporal

data can also be interpreted as soft multi-way clustering methods, in the

sense that the observation at each voxel, v, xv = (x1,v, . . . , xT,v), can be

described by very few (ideally just one) learned “regressors”, if these manage

to sufficiently capture the inherent dynamics in the data. Conversely, the ob-

servations projected on to the extracted regressors, visualized in latent space,

should ideally cluster around those vectors. The feature-space SMF model

was defined in a way that did not pre-suppose any particular feature set.

We are free to combine the temporal dimensionality reduction of manifold

learning with the sparse wavelet decomposition in the spatial dimension. We

also note here that dimensionality reduction schemes are fully compatible

with automatic relevance determination (ARD), since ARD is used here to

select the relavance of the L bases, alLl=1. That is, the dimensionality of

the latent space is inferred to be less than or equal to L. In practice, L itself

can result (for high-dimensional problems where we believe that L ≪ D)

from any dimensionality reduction method. This, somewhat pragmatic, ap-

proach was advocated by Neal and Zhang in [143] for their winning entry in

the NIPS 2003 challenge, where PCA was used as dimensionality reduction

pre-processing.

In this experiment, the Locality Preserving Projection (LPP) technique

of He and Niyogi [83] was used, due to its computational simplicity, although

any manifold learning method could, in principle, be used. LPP starts by

regarding each data point x ∈ X = x1, . . . ,xm ⊂ Rn (using the notation

of [83]) as the vertex of a graph, G, which is itself considered as a sample

from a manifold, M, embedded in Rn. The goal is to find a linear map

216

f : M → Rl (l ≪ n), represented by a matrix, A, such that locality is

preserved, i.e. points close together on the manifold (graph) are mapped

close together in the transformed space, Rl. We denote the projected points

by yi = Axi. The graph is built by exploiting neighboring information in the

data set. If xi and xi′ are close, they are considered neighbors, and an edge

is put between them. Connectivity can be defined by various criteria and can

be “topological”, such as taking the k nearest neighbors of each data point,

or “geometrical”, by regarding as neighbors all points lying at a distance

smaller than ǫ. The relative importance of each point is defined by a weight

matrix, W, which can again be defined using a variety of ways, the simplest

being setting Wii′ = 1 if i and i′ are neighbors, to using the heat kernel,

Wii′ = e−1t‖xi−xi′‖. “Closeness” is defined by the Riemannian metric on M

and by the weight W on the graph G. Note that we have implicitly used the

definition of a manifold, as a smooth object locally approximated by a plane,

and the fact that if xi′ is in the neighborhood of xi we can use the Euclidean

distance, ‖ ·− · ‖, in the tangent “plane” (space) El. The projection operator

is found by solving an optimization problem that minimizes the norm of

y = (y1, . . . , ym) under the W metric, ‖y‖W = yTWy,

J =∑

ii′

(yi − yi′)2Wii′ ,

so that connected points stay close together, and using the locality preser-

vation requirement, ‖y‖D = yTDy, which weighs the relative importance of

yi ∈ yi using the diagonal matrix D with Dii =∑

i′ Wii′ , such that the

“roughness” of the map is minimized,

yTDy = 1⇒(aTX

)D(XTa

)= 1 .

Note that this weighs the vector of images, y, as a whole. By its definition,

217

the weight matrix D provides a natural measure on the data points. The

optimization problem then becomes

min

aTXLXTa

subject to aTXDXTa = 1, (5.52)

where L is the Laplacian matrix L = D −W. As in the Laplacian on

a manifold, the discrete graph version presented here penalizes too much

deviation from smoothness. The above optimization problem is equivalent

to solving the generalized eigenvalue problem

(λ, a) : XLXTa = λXDXTa . (5.53)

We would now like to sparsely represent the dataset by applying SMF

in the reduced-dimensionality projection space El defined by the manifold

learning. Note that while the geometric object is nonlinear, the projection

operation of the LPP method is linear. As before, we want to extract the

CTRs from the fMRI data. Using LPP we first reduce the dimensionality of

the dataset in the temporal dimension. We found empirically that a latent

dimension of 5 was sufficient to capture the essential information in the data

for subsequent processing. We then run VB-SMF on the reduced dataset.

The algorithm converged in about 40 iterations. The evolution of the nega-

tive free energy is shown in Fig. 5.9. Examining the variances, a−1l , of the

columns of the mixing matrix, al, it is clear that there is a dominant regres-

sor, while the rest contribute very little in explaining the data. A histogram

of the a−1l Ll=1 is shown in Fig. 5.10. Selecting the component that cor-

responds to the index with the highest variance as the CTR, we obtain the

spatial map and time-course shown in Fig. 5.11. The correlation coefficient

between the VB-SMF extracted timecourse and the reference function from

218

-140000

-139000

-138000

-137000

-136000

-135000

-134000

-133000

0 5 10 15 20 25 30 35 40

Evolution of the free energy, FVB-SCA(MF)

Figure 5.9: Evolution of the negative free energy, F , of the VB-SMF modelrun on the reduced dimensionality data set of Friston et al. [71].

the GLM was rVB−SMF = 0.922. Finally, as in sections 3.4.3 and 4.6.2, we

plot the ROC curve with respect to the spatial map obtained from the GLM

in Fig. 5.12. The ROC power, defined as the area under the curve (AUC)

was AUCVB−SMF = 0.867.

5.5 Discussion

We have presented a sparse representation model incorporating wavelets and

sparsity-inducing adaptive priors under a full Bayesian paradigm. This en-

ables the estimation of both latent variables and basis functions, in a prob-

abilistic graphical modelling formalism. We employed a variational frame-

work for efficient inference. The preliminary results presented here suggest

improved performance compared to other state-of-the-art model-free tools,

such as PICA, while potentially allowing for more interpretable activation

patterns due to the implicit denoising.

219

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 1 2 3 4 5 6

Variance (α-1) of SCA components

Figure 5.10: Variances of the columns of the mixing matrix, a−1l Ll=1, show-

ing that there is a dominant regressor describing the data set. The x-axis isindexed by l and the y-axis ranges from 0 to 0.4.

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

10 20 30 40 50 60

Timecourse a1 from IT-SCA decomposition



Figure 5.11: Time course (thick black curve) and corresponding spatial mapresulting from applying the VB-SMF model to the reduced-dimensionalityfMRI data set of Friston et al. [71]. The time-paradigm is shown as the bluecurve and the expected timecourse as the green one.

220

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Figure 5.12: Receiver operating characteristic curve for the spatial map ofthe data set of Friston et al. [71].

221

Chapter 6

Epilogue

In this thesis we have studied the problem of decomposing data into compo-

nents when sparsity of the representation is a key prior assumption. Sparsity,

as an algorithmic paradigm for empirical data modelling, has become increas-

ingly popular in Machine Learning and Applied Statistics over the past few

years, as can be witnessed by the flurry of research activity in this area.

Our work focused on modelling the blind inverse problem of source sepa-

ration under sparsity constraints, understanding the factors which influence

its efficiency, empirically evaluating its performance, and characterizing and

improving its behavior. We studied these issues for the problem of decom-

posing and representing datasets, especially high-dimensional data matrices,

with emphasis on Imaging Science as our primary application domain. We

focused our attention on extracting meaningful activations from weak spatio-

temporal signals such as functional MRI.

Neuroimaging techniques, such as functional magnetic resonance imaging

(fMRI), have revolutionised the field of neuroscience. They provide non-

invasive, or minimally invasive, regionally-specific in vivo measurements of

the brain activity. However, these measurements are only indirect observa-

222

tions of the underlying neural processes. Reconstructing brain activations

from observations is an inverse problem. Consequently, there is a need for

mathematical models, linking observations and domain-specific variables of

interest, and algorithms for making inferences given the data and the mod-

els. Currently, these models are sometimes based on the physical generative

process, but more often on statistical principles.

The main tool for “model-free” decompositions of neuroimaging data into

components currently in use is ICA. While ICA is a powerful decomposition

method, and oftentimes uses sparse source models, its fundamental under-

lying assumption is statistical independence. For many real-world physical

and biological datasets this assumption is not plausible.

In this Thesis we proposed three methods that are based explicitly on

sparsity, and implicitly on a generalized notion of smoothness. We directly

encoded properties of the components that are favored by neuroscientists in

our priors, in particular the properties of smoothness, localization, and spar-

sity. The localization property of activations in space is neuroscientifically

plausible, and leads to a notion of ‘sparsity’ via the use of basis functions

with local support. We emphasize here that locality is fully compatible with

the idea of possibly disconnected, distributed activations/representations.

The smoothness property seems biologically reasonable, as fMRI measures

changes in blood oxygenation, which induce a “blurring” effect, and accounts

for both the spatially extended nature of the brain haemodynamic response

and the distributed nature of neuronal sources. This is one of the main rea-

sons (along with increasing the SNR) for the spatial smoothing preprocessing

typically employed in fMRI image analysis. In contrast to the technique of

Gaussian kernel smoothing with fixed width, however, usually used in prac-

tice, the use of multiresolution families encompases many spatial scales in

modelling, obviating the need to pre-select a particular spatial scale, as well

as provining additional benefits, which will be discussed below. Functional

223

MRI data analysis is therefore a very fruitful application of sparse models.

However, the aim of sparse representation seems intrinsically significant and

fundamental beyond neuroimaging. As discussed in the text, sparseness (with

respect to appropriate dictionaries) lead to energy efficient, minimum entropy

representations, a characteristic that seems to have a physical significance.

The material of this Thesis is split into six Chapters. The context and

background of the research presented in this Thesis was given in the first

chapter and the problem and motivating application of the developed meth-

ods was introduced there as well.

The required background on methods for data decomposition was pro-

vided in the second chapter. The material was organized from the point

of view of second versus higher-order decompositions. The various methods

were motivated as techniques for seeking structure in data. In particular, an

intuitive geometric interpretation was emphasized, where component analy-

sis methods are seen as seeking an intrinsic coordinate system in data space

that best reveals the structure within the data distribution. —This will

later lead to an insight about a possible reason for the apparent success of

ICA-type separation methods for fMRI.— While second-order methods such

as PCA are still valuable as a pre-processing step, e.g. for dimensionality

reduction purposes, the orthogonality constraint implicitly imposed on the

principal components lead to unphysical results if these are interpreted as the

spatial maps and corresponding time-courses. It was emphasized there that

the inverse problem of source extraction requires statistics beyond beyond

simple covariance and correlation. However, PCA can be recovered as an

optimization problem that minimizes the least-squares reconstruction error.

—This observation forms the basis for the penalized decomposition model

that enforces sparsity by adding a sparsity-inducing penalty term described

in the third chapter.— Various independent component analysis algorithms

were then introduced and their probabilistic interpretation was given. ICA

224

is currently the standard workhorse for exploratory analysis in fMRI, and

forms the ‘golden standard’ for model-free methods. The importance of ap-

propriate source model choice was then stressed and exemplified on a simple

example from image separation. The failure of ICA to separate those sources,

although independently generated, however, was given as a first indication

of the need for a different paradigm for source recovery, sparsity.

The main topic of this Thesis, sparse representation and data decomposi-

tion, was then introduced. Sparse representations were motivated by results

from the neuroscience community, and in particular the ability of the sparse

model of Olshausen and Field [137] to reproduce the receptive fields in the

primary visual cortex by learning sparse representations of natural scenes.

This lead to a line of research in vision, statistics, and machine learning into

efficient representations of data. For our purposes, sparse representation

forms the foundation for source separation and decomposition of data matri-

ces. The methods of Donoho [50], and Zibulevsky and Pearlmutter [176] are

particularly relevant here, the latter actually forming the base formulation

for the sparse component analysis model presented in Chapter 3. However,

the former method is not adapted to the geomerty of the patterns of activa-

tions in fMRI while the latter suffers from the problems of gradient methods.

In Chapters 3 and 4 we propose methods that address these issues. Prob-

abilistic modelling plays again an important role here, by allowing natural

extensions to more general model representations. Chapter 2 concluded with

an introduction to the Bayesian approach to data analysis, providing the

statistical framework for this Thesis.

Chapters 3 to 5 described the new methods for sparse data decomposi-

tion and blind source separation developed for this Thesis. Motivated by the

results of Benharrosh et al. [17], Golden [77], and Daubechies et al. [46],

in Chapter 3 we proposed to view the problem of blind source separation as

an inverse problem with sparsity constraints. This allows us to use results

225

from that field, and in particular the recently introduced method of iterative

thresholding. With the judicious use of appropriate constraints, we regu-

larize the inversion toward the desired solution for our problem domain. In

particular, we impose constraints that capture the neuroscientifically plausi-

ble properties of smoothness, localization, and sparsity. This Thesis makes

the following contributions:

• Views BSS as an inverse problem and provides an interpretation of the

resulting objective function in an energy minimization framework.

• Applies a novel algorithm for BSS, Iterative Thresholding.

• Proposes a novel way of estimating the crucial threshold parameter.

• Empirically evaluates its performance on complex spatio-temporal datasets

such as fMRI.

In particular, inspired by the energy model of Olshausen and Field [137],

we first formulated the problem in an energy minimization framework using

non-quadratic constraints that promote sparse (minimum entropy) solutions.

This allows us to make the connection with results from the neural compu-

tation literature and to formulate the optimization functional in an intuitive

way. We then followed Donoho’s [50] and Zibulevsky and Pearlmutter’s [176]

idea of representing the unknown signals in a signal dictionary. The central

idea here is to directly exploit the sparse representability of the spatial part

of the solution in appropriate dictionaries. Using mathematical arguments,

we defined precisely the kind of dictionary that has the potential to produce

sparse representations for the class of signals we are interested in extracting.

We then adapted the recently developed method of iterative thresholding of

Daubechies et al. [45] to the case of sparse representations with an unknown

basis and proposed a practical, if heuristic, method for estimating the cru-

cial, threshold parameter for fMRI data. After a few simulations that aimed

226

at demonstrating the applicability of SCA-IT to acoustic and natural image

data we turned to our main application, fMRI data decomposition. We first

applied our method to the simulated experiment of [46] and showed that it

successfully separated mixtures in which standard ICA fails. Furthermore,

in contrast to standard ICA, when applied to real fMRI data our model was

capable of extracting activations that are represented in single spatial maps,

and which could be interpreted as corresponding to a single physiological

cause.

The threshold estimation procedure proposed in Chapter 3, although

works well in practice for fMRI data, is rather ad-hoc. It made certain

assumptions about a number of settings that may not hold in more general

data processing problems. A more principled approach would be needed if

the model aims to be appealing in a variety of blind inverse problem sit-

uations. We addressed the above problems under a Bayesian paradigm in

Chapter 4. The interpretation of classical models from a probabilistic point

of view introduced in the second chapter highlights better the structure of

component analysis models in general: one can define a “family tree of com-

ponent analysis models” that are constructed using different probabilistic as-

sumptions about the latent signals, the observation operator and the noise,

and that are linked using increasingly sophisticated priors. This structure

can be visualized using directed acyclic graphs. In Chapter 4 we proposed

to reinterpret the variational optimization functional of Daubechies et al.

[45] from a Bayesian point of view. We derived an efficient variational EM-

type algorithm for blind linear inverse problems by lower-bounding the data

likelihood and we showed that it contains the original iterative thresholding

algorithm of Daubechies et al. as a special case. As in the original IT al-

gorithm of Daubechies et al., our algorithm is guaranteed to convergence,

and to increase the data likelihood at each iteration, due to the convergence

properties of the EM algorithm. In contrast to the original variational func-

227

tional, however, where the regularization coefficients have to be set a-priori

or estimated using a variety of ad-hoc procedures, in our algorithm these

have a precise probabilistic meaning, as hyperparameters of probability dis-

tributions. This allows us to estimate them from the data itself. We exploit

this fact and use a model for the expansion coefficients of the latent signals

that adapts to the sparsity characteristics of the individual signals at hand.

We prove that the optimal value of the threshold is precisely a function of

those hyperparameters and the system noise precision. As also noted by

MacKay [116], in contrast to the classical methods, the Bayesian approach

automatically uses the semantically correct objects during estimation: for

the threshold on the expansion coefficients of the latent signals it only makes

sense to use the state noise precision, not the sensor one.

We applied Bayesian Iterative Thresholding to the problem of blind source

separation. We first experimented with an audio separation problem used in

[147], [?], and [149] as benchmark. Our method achieved higher reconstruc-

tion accuracy over these previous methods. On a more real-world task, that

of separating fMRI data, the Bayesian IT algorithm performed better than

the energy IT (which in turn outperformed ICA on the same task).

Finally, the 5th Chapter dealt with the related problem of sparse matrix

factorization. We especially focused on decomposing data matrices that rep-

resent spatio-temporal data. Given a data matrix, the goal is to factorize

it into a product of a matrix containing the dynamics times a matrix con-

taining the spatial variation, or equivalently, as an inference problem, invert

the effect of these two factors. We enforced sparsity by transforming the

spatial dimension of signals in a wavelet basis and using a sparse prior on

the corresponding wavelet coefficients. This sparsity-enforcing combination

allowed us to avoid the problems mentioned in remark 6, that is that directly

modelling the gray-scale values of an image using a MoG may result in loss

of geometric information. Spatial wavelets have a larger support than vox-

228

els and function as “super-voxels” capturing local geometric features. Then

the sparse prior models distributions in feature-space, not voxel intensities

themselves. To ensure that a parsimonious set of temporal bases is inferred

we used a hierarchical, ‘sparse Bayesian’-style prior and automatic relevance

determination. Signal reconstruction and parameter estimation were carried

out in a fully Bayesian paradigm. As in all modelling work, there is a trade-off

between mathematical elegance and efficiency. For large-scale datasets ob-

taining an analytic solution is desirable. We chose to work in the variational

Bayesian framework, using distributions from the conjugate-exponential fam-

ily. The model typically converged in far fewer iterations than the iterative

thresholding ones. Testing the model on benchmark fMRI data, it outper-

formed state-of-the-art ICA algorithms while at the same time extracting

much cleaner maps, which can potentially offer better interpretability. We

finally experimented with using a manifold learning technique, ‘Locality Pre-

serving Projections’ (LPP), for dimensionality reduction with very promising

results.

Open Questions and Future Work

Although the performance and benefits of the sparse decomposition algo-

rithms presented in this Thesis have been comprehensively demonstrated,

there are still some practical issues to address. We discuss these, along with

some suggestions for possible future extensions, below.

• The overcomplete source separation experiment for the coctail party

problem discussed in Sect. 3.4.2 (Lee et al., [107]) provided results that

are on par with those of [107] but not better than those of [76]. This

is not an inherent deficiency of the method, however. As noted by

Zibulevsky et al. in [177] and Li et al. in [111] wavelet bases are not

229

the most adapted to this types of signals and the best results were

obtained using the spectrogram directly. More experiments, using a

better dictionary, could be performed using this approach.

• For the energy IT model, an interesting alternative to explore and com-

pare with for estimating the regularization coefficients is the homotopy

method. As noted earlier, the optimization cost function of Eq. 3.11

can be seen as searching for a minimizer that is a smooth interpolation

between a solution minimizing an ℓ2 measure (for the data-fit term) and

a sparseness penalty term, controlled by the Lagrange multiplier λC.

The homotopy method explores the whole regularization path between

these two extremes. This will provide a principled way of applying

the method to a wider variety of applications beyond fMRI. We also

note that the least angle regression (LARS) method [57] used the same

approach as an extension to LASSO sparse modeling [81].

• While the wavelet transform is known to largely decorrelate the obser-

vations, there are still some residual dependencies among wavelet co-

efficients. One could exploit these and design priors that model these

dependencies; see [41], for example. This could lead to even better

separation results. The Bayesian framework directly supports these hi-

erarchical extensions to the models discussed in this Thesis. This is an

ongoing area of research.

• The fully Bayesian variational approach allows for sparse priors and ma-

trix factorization models conditioned directly on wavelet transformed

data. We showed that this significantly improves separation results and

allows for simultaneous unmixing and wavelet-based denoising. Ide-

ally, however, the functional mapping between the observations and the

wavelet coefficients (which we may take to be non-linear as well) would

230

be inferred along with the sparse matrix factorization (SMF) model.

This leads naturally to extensions of non-linear SMF. A promising av-

enue of research along this direction is to combine the VB-SMF model

with the data-dependent non-linear transformations coming from the

manifold learning field. Preliminary experiments with such a combina-

tion were shown in the 5th Chapter.

• For the VB-SMF model, the prior for the timecourses, as it is now, does

not exploit the temporal structure in the signals. Timeseries models

could provide a better extraction of the dynamics of the activations.

Work in this area by Woolrich et al. [171] and others has provided

examples and methodologies of such an approach.

• The software implementation of the models discussed in this Thesis

is this of research-level prototypes. Therefore, in terms of speed and

robustness, there is still work to be done. From a software engineer-

ing viewpoint, specialized data structures for storing, processing, and

retrieving the huge datasets encountered in fMRI could be used. An-

other promising area is to exploit the inherent parallelism offered by

advances in hardware. Variational Bayesian learning can be imple-

mented as message-passing on graphical probabilistic models, leading

to the Variational Message Passing algorithm [170], which factorizes

over cliques. This hints at the possibility of implementing sets of esti-

mation equations for performing inference that are updated in parallel.

• Finally, as commented on earlier in the Thesis, full results of applying

sparse models on neuroscientific data, and their importance for brain

science will have to be thoroughly investingated. This should be done

in collaboration with experts in the field of neuroscience.

231

Bibliography

[1] F. Abramovich and B. W. Silverman. Wavelet decomposition ap-proaches to statistical inverse problems. Biometrika, 85(1):115–129,1998.

[2] S.-I. Amari. Natural Gradient Works Efficiently in Learning. NeuralComputation, 10:251–276, 1998.

[3] H. Asari, B. Pearlmutter, and A. Zador. Sparse representations for thecocktail party problem. Journal of Neuroscience, 26(28):7477–7490,2006.

[4] J. Atick and A. Redlich. Towards a theory of early visual processing.Neural Computation, 2(3):308–320, 1990.

[5] J. Atick and A. Redlich. What does the retina know about naturalscenes? Neural Computation, 4(2):196–210, Mar. 1992.

[6] H. Attias. Independent Factor Analysis. Neural Computation,11(4):803–851, May 1999.

[7] H. Attias. A Variational Bayesian Framework for Graphical Models.In In Advances in Neural Information Processing Systems 12, pages209–215. MIT Press, 2000.

[8] A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra. Clustering on the UnitHypersphere using von Mises-Fisher Distributions. J. Mach. Learn.Res., 6:1345–1382, December 2005.

[9] R. Baraniuk. Compressive Signal Processing, 2010.

[10] Barlow. Unsupervised Learning. Neural Comp., 1(3):295–311, 1989.

[11] H. B. Barlow. Unsupervised Learning. Neural Computation, 1(3):295–311, 1989.

232

[12] H.B. Barlow, T P Kaushal, and G.J. Mitchison. Finding minimumentropy codes. Neural Computation, 1:412–423, 1989.

[13] H.B. Barlow, T.P. Kaushal, and G.J. Mitchison. Finding MinimumEntropy Codes. Neural Computation, 1(3):412–423, 1989.

[14] J. Basak, A. S., D. Trivedi, M. S. Santhanam, T.-W. Lee, and E. Oja.Weather data mining using independent component analysis. Journalof Machine Learning Research, 5:239–253, 2004.

[15] C. Beckmann and S. Smith. Probabilistic independent component anal-ysis for functional magnetic resonance imaging. IEEE Trans. Med.Imag., 23:137–152, 2004.

[16] A. Bell and T. J. Sejnowski. An information-maximisation approachto blind separation and blind deconvolution. Neural Computation,7(6):1129–1159, 1995.

[17] M. Benharrosh, E. Roussos, S. Takerkart, K. D’Ardenne, W. Richter,J. Cohen, and I. Daubeschies. ICA components in fMRI analysis: Inde-pendent sources? In 10th Neuroinformatics Annual Meeting, A Decadeof Neuroscience Informatics: Looking Ahead, April 2004.

[18] A. Blake and A. Zisserman. Visual Reconstruction. MIT Press, 1987.

[19] V.I. Bogachev. Gaussian measures. Amer. Math. Soc., 1998.

[20] B. Borchers. A Bayesian Approach to the Discrete Linear Inverse Prob-lem, 2003.

[21] E. Bullmore, J. Fadili, M. Breakspear, R. Salvador, J. Suckling, andM. Brammer. Wavelets and statistical analysis of functional magneticresonance images of the human brain. Statistical Methods in MedicalResearch, 12:375–399, 2003.

[22] W. L. Buntine. Operations for Learning with Graphical Models. Jour-nal of Artificial Intelligence Research, 2:159–225, 1994.

[23] V. D. Calhoun and T. Adali. Unmixing fMRI with independent com-ponent analysis. IEEE Engineering in Medicine and Biology Magazine,25(2):79–90, March-April 2006.

[24] V.D. Calhoun, T. Adali, G.D. Pearlson, and J.J. Pekar. Spatial andtemporal independent component analysis of functional MRI data con-taining a pair of task-related waveforms. Human Brain Mapping,13(1):43–53, 2001.

233

[25] D. Calvetti and E. Somersalo. An Introduction to Bayesian ScientificComputing: Ten Lectures on Subjective Computing, volume 2 of Sur-veys and Tutorials in the Applied Mathematical Sciences. Springer,2007.

[26] J.-F. Cardoso. Infomax and maximum likelihood for blind separation.IEEE Signal Process. Lett., 4(4):112–114, 1997.

[27] J.-F. Cardoso. Blind signal separation: statistical principles. Proceed-ings of the IEEE, 90(8):2009–2026, Oct. 1998.

[28] J.-F. Cardoso. The three easy routes to independent component anal-ysis; contrasts and geometry. In Proc. of the ICA 2001 workshop, SanDiego, Dec. 2001.

[29] M. A. Carreira-Perpinan. Continuous latent variable models for di-mensionality reduction and sequential data reconstruction. PhD thesis,University of Sheffield, UK, 2001.

[30] M. Carroll, G. Cecchi, I. Rish, R. Garg, and A. Rao. Prediction andinterpretation of distributed neural activity with sparse models. Neu-roImage, 44(1):112–122, 2009.

[31] A. Chambolle, R. A. DeVore an N. Lee, and B. J. Lucier. Nonlin-ear wavelet image processing: Variational problems, compression, andnoise removal through wavelet shrinkage. IEEE Trans. on Image Proc.,7(3):319–355, July 1998.

[32] S. Chen, D. L. Donoho, and M. A. Saunders. Atomic Decompositionby Basis Pursuit. SIAM J. Sci. Comput., 20:33–61, 1998.

[33] Z. Chen and S. Haykin. On different facets of regularization theory.Neural Computation, 14:2791–2846, December 2002.

[34] H. Choi and R. G. Baraniuk. Wavelet Statistical Models and BesovSpaces. In Proceedings of SPIE Technical Conference on Wavelet Ap-plications in Signal Processing VII, Denver, USA, Jul. 1999.

[35] H. Choi and R. G. Baraniuk. Multiple Wavelet Basis Image DenoisingUsing Besov Ball Projections. IEEE Signal Processing Letters, 11(9),September 2004.

[36] R. Choudrey, W.D. Penny, and S.J. Roberts. An ensemble learningapproach to independent component analysis. In Proceedings of the

234

2000 IEEE Signal Processing Society Workshop, pages 435–444. IEEENeural Networks for Signal Processing X, 2000.

[37] R.A. Choudrey. Variational Methods for Bayesian Independent Com-ponent Analysis. PhD thesis, Department of Engineering Science, Uni-versity of Oxford, 2003.

[38] R.A. Choudrey and S.J. Roberts. Flexible Bayesian Independent Com-ponent Analysis for Blind Source Separation. In Proceedings of ICA-2001, San Diego, USA, December 2001.

[39] Pierre Comon. Independent component analysis, a new concept? Sig-nal Processing, 36(3):287–314, 1994.

[40] R. T. Cox. Probability, Frequency, and Reasonable Expectation. Am.Jour. Phys., 14:1–13, 1946.

[41] M.S. Crouse, R.D. Nowak, and R.G. Baraniuk. Wavelet-based statisti-cal signal processing using hidden Markov models. IEEE Transactionson Signal Processing, Apr 1998.

[42] D. L. Donoho and C. Grimes. Hessian eigenmaps: Locally linear em-bedding techniques for high-dimensional data. PNAS, 100(10):5591–5596, May 2003.

[43] G. Darmois. Analyse Generale des Liaisons Stochastiques. Rev. Inst.Internat. Stat., 21:2–8, 1953.

[44] I. Daubechies. Ten lectures on wavelets. Society for Industrial andApplied Mathematics, Philadelphia, PA, USA, 1992.

[45] I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholdingalgorithm for linear inverse problems with a sparsity constraint. Comm.Pure Appl. Math., 57, 2004.

[46] I. Daubechies, E. Roussos, S. Takerkart, M. Benharrosh, C. Golden,K. D’Ardenne, W. Richter, J. Cohen, and J. Haxby. Independent com-ponent analysis for brain fMRI does not select for independence. PNAS,106(26):10415–10412, 2009.

[47] J. Dauwels, S. Korl, and H.-A. Loeliger. Expectation Maximization asMessage Passing. In International Symposium on Information Theory(ISIT), Sept. 2005.

235

[48] A.P. Dempster, N.M. Laird, and D.B Rubin. Maximum Likelihoodfrom Incomplete Data via the EM Algorithm. Journal of the RoyalStatistical Society. Series B (Methodological), 39(1):1–38, 1977.

[49] I. S. Dhillon and D. M. Modha. Concept decompositions for largesparse text data using clustering. Machine Learning, 42(1):143–175,2001.

[50] D. L. Donoho. Sparse components of images and optimal atomic de-compositions. Constructive Approximation, 17:353–382, 2001.

[51] D. L. Donoho and M. Elad. Optimally sparse representation in general(nonorthogonal) dictionaries via ℓ1 minimization. PNAS, 100(5):2197–2202, March 2003.

[52] D. L. Donoho and A. G. Flesia. Can recent advances in harmonicanalysis explain recent findings in natural scene statistics? Network:Computation in Neural Systems, 12:371–393, 2001.

[53] D. L. Donoho and I. M. Johnstone. Ideal spatial adaptation by waveletshrinkage. Biometrika, 81(3):425–55, 1994.

[54] D.L. Donoho. De-noising by soft-thresholding. IEEE Trans. on Inf.Theory, 42(3):613–627, 1995.

[55] D. Dueck, Q. D. Morris, and B. J. Frey. Multi-way clustering of mi-croarray data using probabilistic sparse matrix factorization. Bioinfor-matics, 21(suppl 1), 2005.

[56] A. W. Eckford. Channel estimation in block fading channels using thefactor graph EM algorithm. In Proc. 22nd Biennial Symposium onCommunications, 2004.

[57] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angleregression. Ann. Statist., 32(2):407–499, 2004.

[58] Richard Everson and Stephen Roberts. Independent Component Anal-ysis: A Flexible Nonlinearity and Decorrelating Manifold Approach.Neural Computation, 11(8):1957–1983, November 1999.

[59] F. Girosi and M. Jones and T. Poggio. Regularization Theory andNeural Networks Architectures. Neural Computation, 7(2):219–269,Mar. 1995.

236

[60] F. O’Sullivan. A Statistical Perspective on Ill-Posed Inverse Problems.Statistical Science, 1(4):502–518, Nov. 1986.

[61] H. Farid and E. H. Adelson. Separating Reections and Lighting UsingIndependent Components Analysis. In Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition, vol-ume 1, pages 262–267, Fort Collins CO., 1999.

[62] D. J. Field. Wavelets, vision and the statistics of natural scenes. Phil.Trans. R. Soc. Lond. A, 357:2527–2542, 1999.

[63] D. J. Field. Wavelets, vision and the statistics of natural scenes. Phil.Trans. R. Soc. Lond. A, 357(1760):2527–2542, September 1999.

[64] David J. Field. Relations between the statistics of natural images andthe response properties of cortical cells. J. Opt. Soc. Am. A, 4:2379–2394, 1987.

[65] B. G. Fitzpatrick. Bayesian analysis in inverse problems. Inverse Prob-lems, 7(5), 1991.

[66] G. Flandin and W.D. Penny. Bayesian fMRI data analysis with sparsespatial basis function priors. NeuroImage, 34(3):1108–1125, 2007.

[67] P. Foldiak. Forming sparse representations by local anti-Hebbian learn-ing. Biol. Cybernetics, 64(2):165–170, 1990.

[68] D. Freedman and P. Diaconis. On the histogram as a density estimator:L2 theory. Probability Theory and Related Fields, 57(4):453–476, Dec.1981.

[69] J. H. Friedman and J. W. Tukey. A Projection Pursuit Algorithmfor Exploratory Data Analysis. IEEE Transactions on Computers,23(9):881–890, Sep. 1974.

[70] K. Friston. Models of brain function in neuroimaging. Ann. Rev.Psychol., 56:57–87, 2005.

[71] K. J. Friston, P. Jezzard, and R. Turner. Analysis of functional MRItime-series. Human Brain Mapping, 1(2):153–171, 1994.

[72] Z. Ghahramani. Solving inverse problems using an EM approach todensity estimation. In M.C. Mozer, P. Smolensky, D.S. Touretzky, J.L.Elman, and A.S. Weigend, editors, Proceedings of the 1993 Connec-tionist Models Summer School, page 8, 1994.

237

[73] Z. Ghahramani and M. J. Beal. Variational Inference for BayesianMixtures of Factor Analysers. In Sara A. Solla, Todd K. Leen, andKlaus-Robert Muller, editors, Advances in Neural Information Process-ing Systems 12, [NIPS Conference, Denver, Colorado, USA, November29 - December 4, 1999], pages 449–455. The MIT Press, 1999.

[74] Z. Ghahramani and S. Roweis. Probabilistic models for unsupervisedlearning. Tutorial presented at the 1999 NIPS Conference, 1999.

[75] M. Girolami, editor. Advances in Independent Component Analysis.Perspective on Neural Computing. Springer-Verlag, New York, Aug.2000.

[76] M. Girolami. A Variational Method for Learning Sparse and Overcom-plete Representations. Neural Computation, 13(11):2517–2532, Nov.2001.

[77] C. Golden. Spatio-Temporal Methods in the Analysis of fMRI Data inNeuroscience. PhD thesis, Princeton University, 2005.

[78] A. Groves. Bayesian Learning Methods for Modelling Functional MRI.PhD thesis, Department of Clinical Neurology, University of Oxford,2010.

[79] J. Hadamard. Sur les problemes aux derives partielles et leur signifi-cation physique. Princeton University Bulletin, 13:49–52, 1902.

[80] G.F. Harpur and R.W. Prager. Development of low entropy codingin a recurrent network. Network: Computation in Neural Systems, 7,1996.

[81] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of StatisticalLearning: Data Mining, Inference, and Prediction. Springer, secondedition, February 2009.

[82] J. V. Haxby, M. I. Gobbini, M. L. Furey, A. Ishai, J. L. Schouten,and P. Pietrini. Distributed and overlapping representations of facesand objects in ventral temporal cortex. Science, 293(5539):2425–2430,September 2001.

[83] X. He and P. Niyogi. Locality Preserving Projections. In Advances inNeural Information Processing Systems, volume 16. MIT Press, 2003.

238

[84] S. Hochreiter and J. Schmidhuber. Source separation as a by-product ofregularization. In Advances in Neural Information Processing Systems,pages 459–465. MIT Press, 1999.

[85] D. R. Hunter and K. Lange. A Tutorial on MM Algorithms. TheAmerican Statistician, 58(1), February 2004.

[86] A. Hyvarinen. The Fixed-Point Algorithm and Maximum LikelihoodEstimation for Independent Component Analysis. Neural ProcessingLetters, 10(1):1–5, 1999.

[87] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Anal-ysis. Wiley, June 2001.

[88] A. Hyvarinen and R. Karthikesh. Imposing sparsity on the mixingmatrix in independent component analysis. Neurocomputing, 49:151–162, 2002. Special Issue on ICA and BSS.

[89] A. Hyvarinen and E. Oja. A Fast Fixed-Point Algorithm for Inde-pendent Component Analysis. Neural Computation, 9(7):1483–1492,October 1997.

[90] A. Hyvarinen and E. Oja. Independent Component Analysis: Algo-rithms and Applications. Neural Networks, 13(4–5):411–430, 2000.

[91] M. Ichir and A. MohammadDjafari. Bayesian wavelet based signaland image separation. In Bayesian Inference and Maximum EntropyMethods in Science and Engineering, Jackson Hole, Wyoming (USA),August 2004. American Institute of Physics.

[92] J. Idier, editor. Bayesian Approach to Inverse Problems. Wiley-ISTE,June 2008.

[93] A. Ilin and H. Valpola. On the effect of the form of the posterior ap-proximation in variational learning of ICA models. In 4th InternationalSymposium on Independent Component Analysis and Blind Signal Sep-aration, Nara, Japan, April 2003.

[94] D. J. C. Mackay J. W. Miskin. Application of Ensemble Learning ICAto Infrared Imaging. In Conference: Independent Component Analysis- ICA, 2000.

[95] T. Jaakkola. Variational Methods for Inference and Estimation inGraphical Models. PhD thesis, Mass. Inst. of Techn., 1997.

239

[96] A. Jalobeanu, L. Blanc-Feraud, and J. Zerubia. Natural image mod-eling using complex wavelets. In Proc. SPIE Conference on Wavelets,volume 5207, San Diego, August 2003.

[97] E.T. Jaynes and G.L. Bretthorst (editor). Probability Theory: TheLogic of Science. Cambridge University Press, 2003.

[98] C. Jutten and J. Herault. Blind separation of sources, part I: An adap-tive algorithm based on neuromimetic architecture. Signal Processing,24(1):1–10, 1991.

[99] K.H. Knuth. A Bayesian approach to source separation. In C. JuttenJ.-F. Cardoso and P. Loubaton, editors, Proceedings of the First In-ternational Workshop on Independent Component Analysis and SignalSeparation: ICA’99, pages 283–288, Aussios, France, Jan 1999.

[100] K. Kreutz-Delgado, J. F. Murray, B. D. Rao, K. Engan, T.-W. Lee,and T. J. Sejnowski. Dictionary Learning Algorithms for Sparse Rep-resentation. Neural Computation, 15(2):349–396, February 2003.

[101] L. Landweber. An Iteration Formula for Fredholm Integral Equationsof the First Kind. American Journal of Mathematics, 73(3):615–624,Jul. 1951.

[102] H. Lappalainen. Ensemble learning for independent component analy-sis. In Proceedings of the First International Workshop on IndependentComponent Analyis, pages 7–12, 1999.

[103] H. Lappalainen. Fast Fixed-Point Algorithms for Bayesian BlindSource Separation, 1999.

[104] P. Lauterbur and P. Mansfield (Sir). The nobel prize in physiology ormedicine 2003. Nobelprize.org.

[105] N. D. Lawrence and C. M. Bishop. Variational Bayesian IndependentComponent Analysis. Technical report, , 1999.

[106] N. Lazar. The Statistical Analysis of Functional MRI Data. Statisticsfor Biology and Health. Springer, 2008.

[107] T.-W. Lee, M. S. Lewicki, Mark Girolami, and Terrence J. Sejnowski.Blind Source Separation of More Sources Than Mixtures Using Over-complete Representations. IEEE Signal Processing Letters, 6(4), Apr.1999.

240

[108] D. Leporini and J.-C. Pesquet. Bayesian wavelet denoising: Besov pri-ors and non-Gaussian noises. Signal Processing, 81(1):55–67, January2001. Special section on Markov Chain Monte Carlo (MCMC) Methodsfor Signal Processing.

[109] M. S. Lewicki and T. J. Sejnowski. Learning overcomplete representa-tions. Neural Computation, 12(2):337–365, February 2000.

[110] L. Li and T. Speed. Deconvolution of sparse positive spikes: is it ill-posed? Technical Report 586, University of California, Berkeley, Oct2000.

[111] Y. Li, A. Cichocki, and S.-I. Amari. Analysis of sparse representationand blind source separation. Neural Computation, 16(6):1193–1234,June 1994.

[112] Y. Li, A. Cichocki, S-I. Amari, S. Shishkin, J. Cao, and F. Gu. Sparserepresentation and its applications in blind source separation. In Pro-ceedings of the Annual Conference on Neural Information ProcessingSystems, 2003.

[113] R. Linsker. A local learning rule that enables information maximizationfor arbitrary input distributions. Neural Computation, 9:1661–1665,November 1997.

[114] M. Belkin and P. Niyogi. Laplacian Eigenmaps for DimensionalityReduction and Data Representation. Neural Computation, 15(6):1373–1396, Jun. 2003.

[115] D. MacKay. Probable networks and plausible predictions - a review ofpractical bayesian methods for supervised neural networks. Network:Computation in Neural Systems, 6(3):469–505, 1995.

[116] D. J. C. MacKay. Bayesian Interpolation. Neural Computation,4(3):415–447, May 1992.

[117] D. J. C. MacKay. A Practical Bayesian Framework for Backpropaga-tion Networks. Neural Computation, 4(3):448–472, May 1992.

[118] D. J. C. MacKay. Maximum likelihood and covariant algorithms forindependent component analysis. Technical report, University of Cam-bridge, 1996.

241

[119] J. B. MacQueen. Some methods for classification and analysis of mul-tivariate observations. In Proceedings of 5th Berkeley Symposium onMathematical Statistics and Probability, pages 281–297. University ofCalifornia Press, 1967.

[120] S. Mallat. A theory for multiresolution signal decomposition: Thewavelet representation. IEEE Transactions on PAMI, 11(7), 1989.

[121] S. Mallat. A Wavelet Tour of Signal Processing. Academic Press, 3rdedition, 2008.

[122] S. G. Mallat and Z. Zhang. Matching Pursuits with Time-FrequencyDictionaries. IEEE Transactions on Signal Processing, pages 3397–3415, December 1993.

[123] M. McKeown and T. Sejnowski. Independent component analysis offMRI data: examining the assumptions. Hum. Brain Mapp., 6(5-6):368–372, 1998.

[124] M. J. McKeown, L. K. Hansen, and T. J. Sejnowski. Independentcomponent analysis of functional MRI: what is signal and what is noise?Current Opinion in Neurobiology, 13(5):620–629, 2003.

[125] M.J. McKeown, S. Makeig, T.P. G.G. Brownand Jung, S.S. Kinder-mann, A.J. Bell, and T.J. Sejnowski. Analysis of fMRI data by decom-position into independent spatial components. Human Brain Mapping,6(3):160–188, 1998.

[126] T. Minka. Automatic choice of dimensionality for PCA. In Advancesin Neural Information Processing Systems (NIPS), 2000.

[127] T. P. Minka. Expectation-Maximization as lower bound maximization,1998.

[128] J. W. Miskin and D. J. C. MacKay. Ensemble Learning for Blind SourceSeparation. In S. J. Roberts and R. M. Everson, editors, IndependentComponents Analysis: Principles and Practice. Cambridge UniversityPress, 2001.

[129] P. Moulin and J. Liu. Analysis of Multiresolution Image DenoisingSchemes Using Generalized Gaussian and Complexity Priors. IEEETransactions on Information Theory, 45(3), April 1999.

242

[130] J.-P. Nadal and N. Parga. Nonlinear neurons in the low-noise limit: afactorial code maximizes information transfer. Network: Computationin Neural Systems, 5(4):565–581, 1994.

[131] R. M. Neal. Bayesian Learning for Neural Networks. PhD thesis, Dept.of Computer Science, University of Toronto, 1994.

[132] R. M. Neal and G. E. Hinton. A View of the EM Algorithm thatJustifies Incremental, Sparse, and Other Variants. In M. I. Jordan, ed-itor, Learning in Graphical Models, pages 355–368. Dordrecht: KluwerAcademic Publishers, 1998.

[133] S. Ogawa, T. Lee, A. Kay, and D. Tank. Brain magnetic reso-nance imaging with contrast dependent on blood oxygenation. PNAS,87(24)):9868–9872, 1990.

[134] S Ogawa, T M Lee, A R Kay, and D W Tank. Brain magnetic reso-nance imaging with contrast dependent on blood oxygenation. PNAS,87(24):9868–9872, December 1990.

[135] D. P. O’Leary. Regularization of ill-posed problems in image restora-tion. In J.G. Lewis, editor, Proceedings of the Fifth SIAM Conferenceon Applied Linear Algebra, pages 102–105, Philadelphia, 1994. SIAMPress.

[136] B. A. Olshausen. Learning linear, sparse, factorial codes. TechnicalReport AIM-1580, Massachusetts Institute of Technology, Cambridge,MA, USA, 1996.

[137] B. A. Olshausen and D. J. Field. Emergence of simple-cell receptivefield properties by learning a sparse code for natural images. Nature,381:607–609, June 1996.

[138] B. A. Olshausen and D. J. Field. Sparse Coding with an OvercompleteBasis Set: A Strategy Employed by V1? Vision Research, 37(23):3311–3325, 1997.

[139] J. A. Palmer, D. P. Wipf, K. Kreutz-delgado, and B. D. Rao. Vari-ational EM algorithms for non-Gaussian latent variable models. InAdvances in Neural Information Processing Systems 18, pages 1059–1066. MIT Press, 2006.

[140] J. Pearl. Causality: Models, Reasoning and Inference. CambridgeUniversity Press, New York, NY, USA, 2nd edition, 2009.

243

[141] B. Pearlmutter and L. Parra. Maximum Likelihood Blind Source Sep-aration: A context-sensitive generalization of ICA. In Advances inNeural Information Processing Systems 9, pages 613–619, 1997.

[142] T. Poggio, V. Torre, and C. Koch. Computational vision and regular-ization theory. Nature, 317:314–319, September 1985.

[143] R. M. Neal and J.Zhang. Classication for High Dimensional ProblemsUsing Bayesian Neural Networks and Dirichlet Diusion Trees. NIPSFeature Selection Workshop, 2003.

[144] J. O. Ramsay and B.W. Silverman. Functional data analysis. Springer,New York, 2nd edition, 2005.

[145] C. R. Rao. A decomposition theorem for vector variables with a linearstructure. Ann. Math. Statist., 40(5):1845–1849, 1969.

[146] J. Rissanen. Modeling by shortest data description. Automatica,14:465–471, 1978.

[147] S. Roberts. Independent component analysis: source assessment andseparation, a Bayesian approach. Vision, Image and Signal Processing,IEE Proceedings, 145(3):149–154, Jun. 1998.

[148] S. Roberts and R. Everson, editors. Independent Component Analysis:Principles and Practice. Cambridge University Press, 2001.

[149] S. Roberts, E. Roussos, and R. Choudrey. Hierarchy, priors andwavelets: structure and signal modelling using ICA. Signal Process-ing, 84(2):283–297, 2004. Special Section on Independent ComponentAnalysis and Beyond.

[150] J.K. Romberg, H. Choi, and R.G. Baraniuk. Bayesian tree-structuredimage modeling using wavelet-domain hidden Markov models. IEEETransactions on Image Processing, 10(7):1056–1068, Jul 2001.

[151] E. Roussos. A novel ICA algorithm using mixtures of Gaussians. Tech-nical report, CSBMB, Princeton University, 2007.

[152] E. Roussos, S. Roberts, and I. Daubechies. Variational Bayesian Learn-ing for Wavelet Independent Component Analysis. In Proceedings ofthe 25th International Workshop on Bayesian Inference and MaximumEntropy Methods in Science and Engineering. AIP, 2005.

244

[153] E. Roussos, S. Roberts, and I. Daubechies. Variational Bayesian Learn-ing of Sparse Representations and Its Application in Functional Neu-roimaging. In Springer Lecture Notes on Articial Intelligence (LNAI)– Survey of the State of The Art Series, 2011.

[154] M. W. Seeger. Bayesian Inference and Optimal Design for the SparseLinear Model. Journal of Machine Learning Research, 9:759–813, Apr2008.

[155] E. P. Simoncelli. Vision and the statistics of the visual environment.Current Opinion in Neurobiology, 13(2):144–149, April 2003.

[156] D. S. Sivia and J. Skilling. Data Analysis: A Bayesian Tutorial. OxfordUniversity Press, second edition edition, 2006.

[157] P. Skudlarski, R.T. Constable, and J. C. Gore. ROC Analysis of Sta-tistical Methods Used in Functional MRI: Individual Subjects. Neu-roImage, 9(3):311–329, March 1999.

[158] S. Smith, M. Jenkinson, M. Woolrich, C. Beckmann, T. Behrens,H. Johansen-Berg, P. Bannister, M. De Luca, I. Drobnjak, D. Flitney,R. Niazy, J. Saunders, J. Vickers, Y. Zhang, N. De Stefano, J. Brady,and P. Matthews. Advances in functional and structural MR imageanalysis and implementation as FSL. NeuroImage, 23(S1):208–219,2004.

[159] N. Srebro and T. Jaakkola. Sparse Matrix Factorization of Gene Ex-pression Data. Technical report, MIT Artif. Intel. Lab., 2001. Unpub-lished note.

[160] J. V. Stone. Independent Component Analysis: A Tutorial Introduc-tion. MIT Press, September 2004.

[161] J. V. Stone, J. Porrill, C. Buchel, and K. Friston. Spatial, Temporal,and Spatiotemporal Independent Component Analysis of fMRI Data.18th Leeds Statistical Research Workshop on Spatial-temporal mod-elling and its applications, July 1999.

[162] A. Tarantola. Inverse Problem Theory and Model Parameter Estima-tion. SIAM, 2005.

[163] Y. W. Teh, M. Welling, S. Osindero, G. E. Hinton, T.-W. Lee, J.-F.Cardoso, E. Oja, and S.-I. Amari. Energy-based models for sparseovercomplete representations. Journal of Machine Learning Research,4:2003, 2003.

245

[164] R. Tibshirani. Regression shrinkage and selection via the lasso. J.Royal. Statist. Soc B., 58(1):267–288, 1996.

[165] A. N. Tikhonov and V. Y. Arsenin. Solution of Ill-posed Problems.Winston & Sons, Washington, 1977.

[166] M. E. Tipping. Sparse Bayesian learning and the relevance vectormachine. J. Mach. Learn. Res., 1:211–244, Sep 2001.

[167] A. Tonazzini, L. Bedini, and E. Salerno. A Markov Model for BlindImage Separation by a Mean-Field EM algorithm. IEEE Trans. ImageProc., 15(2), 2006.

[168] F. Turkheimer, M. Brett, J. Aston, A. Leff, P. Sargent, R. Wise,P. Grasby, and V. Cunningham. Statistical modelling of PET im-ages in wavelet space. Journal of Cerebral Blood Flow and Metabolism,20:1610–1618, 2001.

[169] M. Welling and M. Weber. A constrained EM algorithm for Indepen-dent Component Analysis. Neural Computation, 13:677–689, 2001.

[170] J. Winn and C. Bishop. Variational message passing. J. Mach. Learn.Res., 6:661–694, December 2005.

[171] M. W. Woolrich, M. Jenkinson, J. M. Brady, and S. M. Smith. FullyBayesian spatio-temporal modeling of FMRI data. IEEE Trans. Med.Imaging, 23(2):213–31, Feb. 2004.

[172] K. Worsley and K. Friston. Analysis of fMRI time series revisited—again. NeuroImage, 2:173–181, 1995.

[173] X. Shen and F. G. Meyer. Low-dimensional embedding of fMRIdatasets. NeuroImage, 41:886–902, 2008.

[174] A.L. Yuille and A. Rangarajan. The Concave-Convex Procedure(CCCP). Neural Computation, 15(4):915–936, Apr. 2003.

[175] R.S. Zemel. A minimum description length framework for unsuper-vised learning. PhD thesis, University of Toronto, Dept. of ComputerScience, 1993.

[176] M. Zibulevsky and B. A. Pearlmutter. Blind source separation by sparsedecomposition in a signal dictionary. Neural Computation, 13:863–882,April 2001.

246

[177] M. Zibulevsky, B. A. Pearlmutter, P. Bofill, and P. Kisilev. Blindsource separation by sparse decomposition in a signal dictionary. InS. J. Roberts and R. M. Everson, editors, Independent ComponentsAnalysis: Principles and Practice, pages 181–208. Cambridge Univer-sity Press, 2001.

[178] H. Zou and T. Hastie. Regularization and variable selection via theelastic net. Journal of the Royal Statistical Society - Series B: StatisticalMethodology, 67(2):301–320, 2005.

247

bayesian methods for sparse data decomposition and blind...

Documents