bayesian methods for sparse data decomposition and blind...
TRANSCRIPT
Bayesian Methods for Sparse
Data Decomposition and Blind
Source Separation
Evangelos Roussos
Wolfson College
University of Oxford
A thesis submitted for the degree of
Doctor of Philosophy
Trinity Term 2012
Contents
1 Introduction 1
2 Principles of Data Decomposition 12
2.1 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Probability Space . . . . . . . . . . . . . . . . . . . . . 13
2.1.2 Three Simple Rules . . . . . . . . . . . . . . . . . . . . 14
2.2 Second order decompositions: Singular value decomposition
and Principal Component Analysis . . . . . . . . . . . . . . . 16
2.3 Higher-order decompositions: Independent Component Analysis 18
2.3.1 Probabilistic Inference for ICA . . . . . . . . . . . . . . 24
2.4 Sparse Decompositions . . . . . . . . . . . . . . . . . . . . . . 30
2.4.1 Parsimonious representation of data . . . . . . . . . . . 30
2.4.2 Sparse Decomposition of Data Matrices . . . . . . . . . 35
2.5 The Importance of using Appropriate Latent Signal Models . 41
2.6 Data modelling: A Bayesian approach . . . . . . . . . . . . . 46
3 An Iterative Thresholding Approach to Sparse Blind Source
Separation 50
i
3.1 Blind Source Separation as a Blind Inverse Problem . . . . . . 50
3.2 Blind Source Separation by Sparsity Maximization . . . . . . . 58
3.2.1 Exploiting Sparsity for Signal Reconstruction . . . . . 58
3.2.2 Exploiting Sparsity for Blind Source Separation: Sparse
Component Analysis . . . . . . . . . . . . . . . . . . . 60
3.3 Sparse Component Analysis by Iterative Thresholding . . . . . 76
3.3.1 Iterative Thresholding . . . . . . . . . . . . . . . . . . 76
3.3.2 The SCA-IT Algorithm . . . . . . . . . . . . . . . . . . 81
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 85
3.4.1 Natural Image Separation . . . . . . . . . . . . . . . . 86
3.4.2 Blind Separation of More Sources than Mixtures . . . . 92
3.4.3 Extracting Weak Biosignals in Noise . . . . . . . . . . 100
3.5 Setting the Threshold . . . . . . . . . . . . . . . . . . . . . . . 120
4 Learning to Solve Sparse Blind Inverse Problems using Bayesian
Iterative Thresholding 124
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.2 Bayesian Inversion . . . . . . . . . . . . . . . . . . . . . . . . 129
4.3 Sparse Component Analysis Generative Model . . . . . . . . . 132
4.3.1 Graphical Models . . . . . . . . . . . . . . . . . . . . . 132
4.3.2 Construction of a Hierarchical Graphical Model for
Blind Inverse Problems under Sparsity Constraints . . 133
4.3.3 Latent Signal Model: Wavelet-based Functional Prior . 142
4.4 Inference and Learning by Variational Lower Bound Maxi-
mization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
ii
4.4.1 Variational Lower Bounding . . . . . . . . . . . . . . . 148
4.4.2 Estimation Equations . . . . . . . . . . . . . . . . . . . 154
4.4.3 Computing the SCA free energy . . . . . . . . . . . . . 166
4.4.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 167
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
4.5.1 Comparison of ordinary and Bayesian iterative thresh-
olding . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 171
4.6.1 BIT-SCA: Simulation Results . . . . . . . . . . . . . . 171
4.6.2 Extraction of weak biosignals in noise . . . . . . . . . . 177
5 Variational Bayesian Learning for Sparse Matrix Factoriza-
tion 180
5.1 Sparse Representations and Sparse Matrix Factorization . . . 180
5.2 Bayesian Sparse Decomposition Model . . . . . . . . . . . . . 190
5.2.1 Graphical modelling framework . . . . . . . . . . . . . 192
5.2.2 Modelling sparsity: Bayesian priors and wavelets . . . . 195
5.2.3 Hierarchical mixing model and Automatic Relevance
Determination. . . . . . . . . . . . . . . . . . . . . . . 200
5.2.4 Noise level model . . . . . . . . . . . . . . . . . . . . . 201
5.3 Approximate Inference by Free Energy Minimization . . . . . 201
5.3.1 Variational Bayesian Inference . . . . . . . . . . . . . . 201
5.3.2 Posteriors for the variational Bayesian sparse matrix
factorization model . . . . . . . . . . . . . . . . . . . . 204
5.3.3 Free Energy . . . . . . . . . . . . . . . . . . . . . . . . 210
iii
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
5.4.1 Auditory stimulus data set . . . . . . . . . . . . . . . . 212
5.4.2 Auditory-visual data set . . . . . . . . . . . . . . . . . 214
5.4.3 Visual data set: An Experiment with Manifold Learning214
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
6 Epilogue 222
iv
List of Figures
1.1 A sketch of brain activations captured by functional MRI . . . 5
2.1 Seeking structure in data. . . . . . . . . . . . . . . . . . . . . 20
2.2 The Olshausen and Field model as a network . . . . . . . . . . 31
2.3 Activities resulting from the model of Olshausen and Field . . 32
2.4 Learned receptive fields (filters) from the sparse coding algo-
rithm of Olshausen and Field. . . . . . . . . . . . . . . . . . . 33
2.5 Geometric interpretation of sparse representation. . . . . . . . 34
2.6 Clustering of a sparse set of points on the unit hypersphere. . 35
2.7 A geometric intuition into sparse priors. . . . . . . . . . . . . 38
2.8 Effect of an incorrect source model specification. . . . . . . . . 42
2.9 Effect of an incorrect source model specification for a blind
image separation problem. . . . . . . . . . . . . . . . . . . . . 43
2.10 ‘Cameraman’ standard test image. . . . . . . . . . . . . . . . 45
2.11 Density modelling using the MoG model. . . . . . . . . . . . . 45
3.1 Conceptual scheme for forward/inverse problems. . . . . . . . 51
3.2 Sparsity function and derivative of the model of Olshausen
and Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
v
3.3 Wavelet coefficients of the ‘Cameraman’ image. . . . . . . . . 64
3.4 A signal exhibiting smooth areas and localized lower dimen-
sional singularities. . . . . . . . . . . . . . . . . . . . . . . . . 68
3.5 A family of ℓp norms for various values of p. . . . . . . . . . . 70
3.6 Three distributions of wavelet coefficients with equal variance
but different kurtosis . . . . . . . . . . . . . . . . . . . . . . . 72
3.7 How the ℓ1 penalty leads to sparse representations . . . . . . . 73
3.8 Soft thresholding function, corresponding to ℓ1 penalization. . 79
3.9 An illustration of the Iterative Thresholding (IT) algorithm as
a double mapping operation among signal spaces. . . . . . . . 82
3.10 Sparse Component Analysis model as a layered “feedforward”
network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.11 ‘Cows & Butterfly’ dataset: original sources, downscaled to
64× 64 pixels. . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.12 IFA results: Separated image of (a) ‘Cows’ and (b) ‘Butterfly’.
While the source images are recovered, there is some presence
of noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.13 Scatterplot of the ‘Cows & Butterfly’ estimated images, after
having been unmixed by IFA. . . . . . . . . . . . . . . . . . . 89
3.14 Probability density functions of the sources as estimated from
the IFA algorithm. M = 4 Gaussian components were used for
each source model. The brown curve is the estimated mixture
density. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.15 SCA-IT results after convergence of the algorithm: separated
images of ‘Cows’ and ‘Butterfly’. . . . . . . . . . . . . . . . . 90
vi
3.16 Scatterplot of the original and SCA-IT unmixed images. . . . 91
3.17 The ‘Cows & Butterfly’ estimated source space, (s1,n, s2,n)Nn=1,
after convergence of the SCA-IT algorithm. . . . . . . . . . . 91
3.18 Empirical histograms of the ‘Cows & Butterfly’ sources after
convergence of the SCA-IT algorithm. . . . . . . . . . . . . . 92
3.19 ‘Terrace’ (reflections) dataset: observations. . . . . . . . . . . 92
3.20 ‘Terrace’ dataset: Scatterplot of observations, (x1,n, x2,n). . . 93
3.21 ‘Terrace’ dataset: Estimated (unmixed) source images. . . . . 93
3.22 ‘Terrace’ dataset: Scatterplot of estimated sources after 100
iterations of SCA/IT. . . . . . . . . . . . . . . . . . . . . . . . 93
3.23 ‘Terrace’ dataset: Empirical histograms of estimated sources
after 100 iterations of SCA/IT. . . . . . . . . . . . . . . . . . 94
3.24 Overcomplete blind source separation: true sources,sl
l=1,...,3,
(unknown). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.25 Overcomplete blind source separation: observed sensor sig-
nals,xd
d=1,...,2. . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.26 Overcomplete blind source separation: scatterplot of sensor
signals,(x1,n, x2,n)
. . . . . . . . . . . . . . . . . . . . . . . 97
3.27 Overcomplete blind source separation: spectrogram, Xl(ωk, tm),
of the sensor signals. . . . . . . . . . . . . . . . . . . . . . . . 98
3.28 Overcomplete blind source separation: Scatterplot of sensor
signals in the transform domain,(X1,n, X2,n)
, where n =
n(ωk, tm) is the single index that results from flattening each
two-dimensional spectrogram image into a long vector. . . . . 98
vii
3.29 Overcomplete blind source separation: estimated sources,sl
l=1,...,3,
by the SCA/IT algorithm. . . . . . . . . . . . . . . . . . . . 99
3.30 Simulated fMRI data. Unmixing the 4 mixtures of rectangular
components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.31 Results of the SCA/IT model applied on the simulated data
of subsection 3.4.3, for Cases 1–4 . . . . . . . . . . . . . . . . 110
3.32 Spatial map from SPM . . . . . . . . . . . . . . . . . . . . . . 112
3.33 Timecourse from SPM . . . . . . . . . . . . . . . . . . . . . . 112
3.34 Thresholded spatial map from SPM . . . . . . . . . . . . . . . 113
3.35 Spatial maps s1, s2 from ICA. . . . . . . . . . . . . . . . . . . 114
3.36 Timecourses a1, a2 from ICA . . . . . . . . . . . . . . . . . . 114
3.37 Spatial map s1 from SCA . . . . . . . . . . . . . . . . . . . . 116
3.38 Timecourse a1 from SCA . . . . . . . . . . . . . . . . . . . . . 116
3.39 Correlation matching . . . . . . . . . . . . . . . . . . . . . . . 118
3.40 Receiver Operating Characteristic (ROC) curves . . . . . . . . 119
3.41 Schematic representation of the Iterative Thresholding algo-
rithm for blind signal separation as implemented in our system.123
4.1 The Bayesian solution to an inverse problem is a posterior
distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
4.2 A directed acyclic graph (DAG) representation for generative
latent variable models. . . . . . . . . . . . . . . . . . . . . . . 133
4.3 Structural transformations of the standard component analy-
sis graphical model. . . . . . . . . . . . . . . . . . . . . . . . . 135
4.4 An image exhibiting smooth areas and structural features. . . 146
viii
4.5 Heavy-tailed distributions for wavelet coefficients. . . . . . . . 147
4.6 Thresholding (shrinkage) functions. . . . . . . . . . . . . . . . 160
4.7 BIT-SCA: Source Reconstruction . . . . . . . . . . . . . . . . 173
4.8 BIT-SCA: Scatterplot . . . . . . . . . . . . . . . . . . . . . . . 173
4.9 BIT-SCA: Wavelet coefficients . . . . . . . . . . . . . . . . . . 174
4.10 BIT-SCA: Empirical histograms . . . . . . . . . . . . . . . . . 174
4.11 BIT-SCA: Evolution of the elements A . . . . . . . . . . . . . 175
4.12 Learning curve: Evolution of the Amari distance . . . . . . . . 176
4.13 Evolution of the NFE . . . . . . . . . . . . . . . . . . . . . . . 177
4.14 Spatial map s1 from BIT-SCA . . . . . . . . . . . . . . . . . . 178
4.15 Timecourse a1 from BIT-SCA . . . . . . . . . . . . . . . . . . 178
4.16 Receiver Operating Characteristic curve . . . . . . . . . . . . 179
5.1 Matrix factorization view of component analysis. . . . . . . . . 184
5.2 Geometric interpretation of sparse representation. . . . . . . . 188
5.3 The probabilistic graphical model for feature-space component
analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
5.4 Variational Bayesian Sparse Matrix Factorization graphical
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
5.5 Sparse mixture of Gaussians (SMoG) prior for the coefficients
cl,λ. Blue/green curves: Gaussian components; thick black
curve, mixture density, p(cl,λ). . . . . . . . . . . . . . . . . . . 197
5.6 A directed graphical model representation of a sparse mixture
of Gaussian (SMoG) densities. . . . . . . . . . . . . . . . . . . 198
ix
5.7 Time courses and corresponding spatial maps from applying
the variational Bayesian sparse decomposition model to an
auditory fMRI data set. . . . . . . . . . . . . . . . . . . . . . 213
5.8 Time courses and corresponding spatial maps from apply-
ing the variational Bayesian sparse decomposition model to
a visual-auditory fMRI data set. . . . . . . . . . . . . . . . . . 215
5.9 Evolution of the negative free energy of the VB-SMF model . 219
5.10 Variances of the columns of the mixing matrix. . . . . . . . . . 220
5.11 Time course and corresponding spatial map from applying the
variational Bayesian sparse decomposition model to fMRI data
set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
5.12 Receiver operating characteristic (ROC) curve for the LPP/VB-
SMF experiment . . . . . . . . . . . . . . . . . . . . . . . . . 221
x
Abstract
In an exploratory approach to data analysis, it is often useful to consider
the observations as generated from a set of latent generators or ‘sources’ via
a generally unknown mapping. Reconstructing sources from their mixtures is
an extremely ill-posed problem in general. However, solutions to such inverse
problems can, in many cases, be achieved by incorporating prior knowledge
about the problem, captured in the form of constraints.
This setting is a natural candidate for the application of the Bayesian
methodology, allowing us to incorporate “soft” constraints in a natural man-
ner. This Thesis proposes the use of sparse statistical decomposition methods
for exploratory analysis of datasets. We make use of the fact that many nat-
ural signals have a sparse representation in appropriate signal dictionaries.
The work described in this paper is mainly driven by problems in the analysis
of large datasets, such as those from functional magnetic resonance imaging
of the brain for the neuro-scientific goal of extracting relevant ‘maps’ from
the data.
We first propose Bayesian Iterative Thresholding, a general method for
solving blind linear inverse problems under sparsity constraints, and we ap-
ply it to the problem of blind source separation. The algorithm is derived
by maximizing a variational lower-bound on the likelihood. The algorithm
generalizes the recently proposed method of Iterative Thresholding. The
probabilistic view enables us to automatically estimate various hyperparam-
eters, such as those that control the shape of the prior and the threshold in
a principled manner.
We then derive an efficient fully Bayesian sparse matrix factorization
model for exploratory analysis and modelling of spatio-temporal data such
as fMRI. We view sparse representation as a problem in Bayesian inference,
following a machine learning approach, and construct a structured genera-
tive latent-variable model employing adaptive sparsity-inducing priors. The
construction allows for automatic complexity control and regularization as
well as denoising.
The performance and utility of the proposed algorithms is demonstrated
on a variety of experiments using both simulated and real datasets. Exper-
imental results with benchmark datasets show that the proposed algorithm
outperforms state-of-the-art tools for model-free decompositions such as in-
dependent component analysis.
ii
Acknowledgements
It has been quite a journey, from Geomatics Engineering, to Neural Net-works and Machine Learning, to Neuroscience image analysis. During thistrip I have been lucky enough to meet, interact with, and collaborate withmany excellent scientists. First and foremost, I would like to thank my super-visor, Prof. Steven Roberts, a fantastic scientist, engineer, and teacher for in-troducing me to the wonderful world of ICA and graphical models and for hisconstant advice, encouragement and support. I would like to thankmy advi-sor at Princeton University, Prof. Ingrid Daubechies, the mother of wavelets,for asking the important and difficult scientific questions, for introducing meto the method of Iterative Thresholding, and for teaching me all that I knowabout wavelets. I would like to thank my mentor, Prof. Michael Sakellariou,a true Rennaiseance Man, for sharing with me deep insights about Engineer-ing in the 21st Century, for his constant support, and for discussing bothscience and life with me. I would like to thank Prof. Demetre Argialas forteaching me everything I know about Remote Sensing and Geomorphometryand his kindness. I would like to thank all the people at Oxford’s PatternAnalysis and Machine Learning group, and especially Dr. Iead Rezek forscientific discussions, his help and support, and his friendship, Dr. MichaelOsborne discussions and for being kind enough to share his Thesis style file,and Dr. Steve Reece for discussions and fun. I would also like to thank thepeople at Princeton’s Program in Applied and Computational Mathematicsand Princeton Neuroscience Institute, and especially Professors Jim Haxbyand Jonnathan Cohen for insightful discussions about functional MRI andneuroscience in general as well as Sylvain Takerkart and Michael Bennharoshfor discussions and fun. Last, but not least, I would like to offer my deepestgratitude to my family and friends for their love and for making all thesepossible.
i
Chapter 1
Introduction
In an exploratory approach to data analysis, it is often useful to consider the
observations as intricate mixtures generated from a set of latent generators, or
‘sources’, via a generally unknown mapping [149]. In the neuroimaging field,
for example, McKeown et al. [125] proposed the decomposition of spatio-
temporal functional magnetic resonance imaging (fMRI) data of the brain
into independent spatial components, expressing the observations as a lin-
ear superposition of products of ‘spatial maps’ and time-courses. In weather
data mining, Basak et al. [14] proposed to model weather phenomena as mix-
tures of ‘stable activities’ where the variation in the observed variables can
be viewed as a mixture of several “independently occuring” spatio-temporal
signals with different strengths. Many other physical and biological phe-
nomena can be better studied in a ‘latent’ signal space that best reveals the
internal structure of the data.
The problem of recovering the sources from a mixture is an extremely ill-
posed one, especially for the noisy ‘overcomplete’ case, where we have more
sources than observations1. Solutions to such inverse problems can, in many
1In the context of the problem of mixture separation, the term ‘overcomplete’ refersto the case where the number of hidden sources is greater than the number of sensors.
1
cases, be achieved by incorporating prior knowledge about the problem, cap-
tured in the form of constraints [45]. From a machine learning perspective,
such problems fall in the unsupervised learning class of problems, since only
the “responses” but not the “stimuli” are observed in any training set. In
such cases, an internal data organization criterion, such as Infomax [113],
can be used; this acts implicitly as a regularizer, making the problem well-
posed2. Problems that unsupervised learning is aimed at, and are relevant
to this text, are (Ghahramani and Roweis, [74]):
• Finding hidden “causes” or sources in data
• Data density modelling
• Dimensionality reduction
• (Soft) Clustering
Component analysis (CA) is a data analysis framework that intends to
extract the underlying hidden factors, sources, or “causes”3 that generate
the observed data. CA can be seen as a class of methods that attempt to
solve the following complementary problems:
• Blind separation of hidden sources (BSS) under an unknown mixing—
an inverse problem. The coctail party problem in acoustic signal pro-
cessing and systems and cognitive neuroscience is the prototypical ap-
plication of BSS. It refers to the ability of the auditory system to sep-
In the linear mixing scenario (see Sect. 2.3), this has the consequence that the numberof columns in the mixing operator is greater than the number of its rows. The system ofequations formed is, therefore, underdetermined; see Eq. 2.3 and Fig. 5.1 later.
2Another example of a classical organization principle, coming from the pattern recog-nition field, is to minimize intra-class distances of patterns while maximizing inter-classones. This principle is suitable for classification problems.
3Where the term ‘causal’ is taken to mean the relation ‘is-influenced-by’ here. We donot try to model causality in the sense of Pearl [140], for example, in this text.
2
arate and extract “interesting” sources, such as the voice of one’s in-
terlocutor, from a mixture of conversations and background noises4.
• Representation of data in new coordinates such that the resulting
signals have certain desired properties, such as being maximally inde-
pendent or, in our case, maximally sparse. Indeed, geometrically, CA
can be seen as a method for finding an intrinsic coordinate frame in
data space (see Figs 2.1 and 2.5 later).
• Soft clustering. Miskin provides an additional view of component
analysis in [94], as an alternative to ‘hard’ multi-way clustering. Under
this view, the data vectors are decomposed as linear combinations of
hidden prototypical vectors (the columns of a “mixing” matrix) modu-
lated by a weight matrix containing the amount of contribution of each
basis to the mixture5 (corresponding to the “sources”). An example of
soft clustering by source separation can be found in [149].
There have been several algorithms for CA and BSS in the statistics and
machine learning literature, from information maximization [16], maximum
likelihood [118], and Bayesian [105], [36] points of view, as an energy-based
model [163], as well as several others from a signal processing perspective
(see for example [87]). This text proposes the use of sparse statistical de-
composition methods [136], [109], [76], [111] for the exploratory analysis of
data sets. We will consider blind inverse problems such as blind source
separation and data decomposition under an unsupervised machine learning
framework, for the purpose of learning the parameters and hyperparameters
of the models from data. In particular, we shall approach the above problems
4While the brain has solved the problem for millions of years, developing computationalmethods that perform equally well is an open research problem.
5The case of pure mixtures corresponds to source vectors of the formsn = (0, . . . , 0, 1, 0, . . . , 0), for the n–th data point, with 1 in the lth position and zeroesin all other positions, picking the lth prototypical vector. Soft clustering corresponds tothe case where the elements sn,l may be non-integers.
3
from a hierarchical, probabilistic modelling perspective and the problem of
estimating the sources as an inference problem in the Bayesian framework
[147], [99].
Sparse Statistical Decomposition Methods
Problem and Motivating Application
The work described in this text is mainly motivated by problems in the anal-
ysis of high-dimensional spatiotemporal datasets, such as those obtained from
functional magnetic resonance imaging of the brain for the neuro-scientific
goal of extracting relevant ‘maps’ from the data. This can be stated as a
‘blind’ source separation problem [125]. Recent experiments in the field of
neuroscience [82], [46] show that these maps are sparse, in some appropriate
sense. Figure 1.1 provides a generic description of the sought-for maps in
a nutshell. A more detailed overview of the problem of decomposing fMRI
data will be given in Sect. 3.4.3. The separation problem can be formu-
lated as a bilinear decomposition model solved by component analysis-type
algorithms, viewed as techniques for seeking sparse components, provided
appropriate distributions for the sources have been assigned [46]. In contrast
to many classical CA algorithms, where an unmixing matrix is first learned
and from that the sources are deterministically estimated (see e.g. [15]),
the algorithms proposed here perform inference and learning of the mixing
operator and the sources simultaneously.
Inspired by problems in imaging neuroscience, namely the extraction
of activations from brain fMRI data, Daubechies et al. [46] conducted a
large-scale experiment designed to test the independence assumption as the
underlying reason for the apparent success of independent component anal-
4
Figure 1.1: A sketch of brain activations (red areas) as a result of environ-mental stimuli, captured by functional MRI, which is one of the patterns wewant to extract with our models. In this figure, which is traced from Haxbyet al. [82], a brain slice in the transverse (horizontal) plane is shown. Ouraim is to capture the neuroscientifically desirable properties of smoothnessand localization of the activations and the sparsity of the spatial maps withrespect to appropriately chosen signal dictionaries. This will be done by uti-lizing ‘soft’ constraints as explained later in the text. Note that the modelsshould also allow for distributed representations to be extracted.
ysis decompositions6 in fMRI. They tested various parameters, such as the
effect of localisation (overlap) of the sources in the quality of separation, the
size of the region of interest (ROI, i.e. “search window”), and the effect of the
source prior, in a well-controlled experiment, both simulated and in-vivo, for
two ‘classical’ ICA methods, InfoMax and FastICA7 (Benharrosh et al., [17];
Golden, [77]; Daubechies et al., [46]). It was shown there that, while ICA
is robust in general, rather surprisingly, both methods occasionally failed to
separate sources that were designed to be independent, while on the contrary
it managed to separate sources that were dependent, casting doubt on the
validity of the independence approach for brain fMRI. Real fMRI data are
6Independent component analysis (ICA) is a class of methods designed to separatemultivariate signals into components that are assumed to be statistically independent.The ICA method will be formally introduced in Chapter 2.
7More general ICA algorithms, which assume less about the components, can separateinto independent components mixtures for which Infomax and FastICA fail [152], [151].Nevertheless, these two are the most used ICA algorithms for brain fMRI.
5
highly unlikely to be generated by brain processes that are independent, in
the mathematical sense at least, as demonstrated by the recent interest in ex-
tracting networks of interconnected “modules”. However, Daubechies et al.
noted that ICA algorithms work well for brain fMRI decompositions if the
components have sparse distributions8. They concluded that independence
is not the right mathematical framework for blind source separation in fMRI
and proposed the development of new mathematical tools based on explicitly
optimizing for sparsity. An analogous observation was made by Donoho and
Flesia [52] in the context of image processing.
The intuition gained from the above problems can be applicable to a much
broader range of other scientific data-sets. Sparsity is a desirable property,
aiding interpretation of results. Sparsity-inducing constraints help us form
compressed representations of large, redundant, and noisy datasets by retain-
ing relevant information and discarding irrelevant one as “noise”. Sparsity
also facilitates model interpretabilty by revealing the most selective dimen-
sions in the data. In a machine learning context, this estimation can be done
in a data-driven fashion. The work presented here continues on this theme
of sparse data decompositions.
Bayesian inference provides a powerful methodology for data analysis
[156] by providing a principled method for using our domain knowledge,
making all our assumptions explicit [97]. The above setting is a natural
candidate for the application of the Bayesian methodology, allowing us to
incorporate ‘soft’ constraints in a natural manner, formulated as prior distri-
butions. Moreover, its probabilistic foundation allows for building hierarchi-
cal models that describe the data generation process, where hard decisions
about certain parameter settings can be deferred to higher levels of inference
and/or learned from the data itself. In a full Bayesian approach, parameters
8In particular, ‘generalized exponential’ distributions, of the form p(s) ∝ exp(−β|s|α)with α 6= 2 — see the Introduction of ref. [46].
6
whose value is unknown are ‘integrated out’, a process which is referred to as
marginalization. Bayesian modelling allows the incorporation of uncertainty
in both the data and the model into the estimation process, thus prevent-
ing ‘overfitting’. It essentially builds families of models for learning and
inference, thus abstracting the model space. The outputs of fully Bayesian
inference are uncertainty estimates over all variables in the model, given the
data, i.e. posterior distributions.
Proposed approach
The general theme we are going to discuss in this text will be based on
choosing appropriate priors for the latent variables (i.e. the unknown signals)
that explicitly reflect our domain knowledge, prior beliefs, or expectations
about the solution. In particular, the main idea can be stated as
Statement 1 For sparse sources, sparsity, rather than independence, is the
right constraint to be imposed.
This will be captured in appropriate optimization functionals and in specially
constructed algorithms. Another central idea is that
Statement 2 Descriptive features lead to sparse representations.
In many cases sparsity is induced by transforming the physical-domain signals
into a feature-space with “maximum representability”. We note here that
sparsity is a relative property, with respect to a given ‘dictionary’ of signals.
This will become more concrete in section 3.2.2. In particular, in that section
we will answer the question of which dictionary is likely to result in sparse
representations of our particular data in question. The above statements
form the underlying hypothesis of this Thesis. To test the above ideas we
will develop models and algorithms for Bayesian data decomposition under
sparsity constraints.
7
In particular, in the third and fourth chapters of this Thesis, we propose
Bayesian Iterative Thresholding, a method for solving blind inverse problems
under sparsity constraints, with emphasis on the multivariate problem of
blind source separation. In contrast to optimizing for maximally independent
components, as in independent component analysis, we capture this more
natural constraint by explicitly optimizing for maximally sparse components.
We present two, complementary, formulations of the method. In the
third chapter, the problem is initially formulated in an energy minimization
framework such that it enforces sparseness and smoothness of the inferred
signals and a novel algorithm for blind source separation, iterative thresh-
olding (Daubechies et al., [45]), is presented. By exploiting the sparseness of
the components in appropriate dictionaries of functions and the choice of a
sparsity-enforcing prior on the associated expansion coefficients, we can ob-
tain very sparse representations of signals. However, this leads to a difficult
optimization problem. In order to solve it we use a variational formulation
that leads to an iterative algorithm that alternates between estimating the
sources and learning the mixing matrix. The formulation of the problem
using functional criteria, as an inverse problem was first suggested by Ingrid
Daubechies while the author was a Visiting Fellow at Princeton University’s
Program of Applied and Computational Mathematics. It was inspired by
the needs of neuroscientists working with fMRI data, and in particular it
is the result of discussions with the members of the Princeton Neuroscience
Institute, for the problem of designing better algorithms for extracting mean-
ingful activations. It was given its energy interpretation and current form by
the author. In particular, I contributed the solution, implementation, and
experimental results. In addition, I proposed the new way of estimating the
critical, threshold parameter.
We then employ Bayesian inversion ideas in the fourth chapter and show
how the Bayesian formulation generalizes the IT algorithm of Daubechies et
8
al. [45]. We cast blind inverse problems as statistical inference problems in
a Bayesian framework, and show how these two formulations, deterministic
and probabilistic, are related. The central modelling idea is that of structural
priors, in the sense of a ‘parametric free-form’ solution (Sivia, [156]). The
structural prior used captures the desired properties of localization, smooth-
ness, and sparseness of the unknown signals and is adaptively learned from
the data. We then derive a variational bound on the data likelihood and
propose a (structured) mean-field variational posterior. This enables us to
derive analytic formulas for the estimation of the latent signals. Under this
formulation, a generalized form of iterative thresholding emerges as a subrou-
tine of the full set of estimation equations of the unknowns of the problem.
The probabilistic view enables us to automatically estimate various hyperpa-
rameters, such as those that control the shape of the prior and the threshold,
in a principled manner; setting these manually can be a delicate and copious
process.
In the fifth chapter, we propose a feature-space sparse matrix factoriza-
tion (SMF) model for spatiotemporal data, transforming the signals into a
domain where the modelling assumption of sparsity of their expansion coef-
ficients with respect to an auxiliary dictionary is natural [176], [91], [152].
Decomposition is performed completely in the wavelet domain for this model.
We will take the view of sparse representation as sparse matrix factorization
here. Blind sparse representation algorithms (i.e. sparse representations
in which both the dictionary and the coefficients of the representation are
unknown, sometimes called “model-free”) can be naturally formulated as
bilinear decomposition models, where the observations are modelled as a fac-
torization of a ‘mixing’ matrix times a ‘source’ matrix (see Fig. 5.1). That is,
the observations are sparsely represented, via the source coefficients, in the
base comprised of the mixing vectors. In this text we take the view of SMF
as a bilinear generative model, and the problem of estimating the factors as
9
a statistical inference problem in the Bayesian framework again [99].
An important goal of this model is to be a transparent alternative to
sparse representation models such as those of Li et al. [112], [111]. Our ap-
proach to achieving this is by using three ingredients: graphical probabilistic
models, automatic determination of each basis element’s relevance to the
representation, and Bayesian model selection. We follow a graphical mod-
elling formalism and use hierarchical source and mixing models. We then
apply Bayesian inference to the problem. This allows us to perform model
selection in order to infer the complexity of the decomposition, as well as
automatic denoising.
In many datasets, the intrinsic dimensionality of the data is much lower
than that of the ‘ambient’ observation space. For the case of Gaussian data
(i.e. data that can be described geometrically by a hyper-ellipsoid), the
principal axes, corresponding to the largest eigenvalues of the data covari-
ance matrix, contain the signal space, that is “most of the information”; the
rest are usually regarded as noise and can be discarded. For non-Gaussian
data, care has to be taken not to discard useful information that is of low
signal power. A prime example of this is in neuroimaging datasets, where
the ‘activations’ power is just a few percent (e.g. on the order of 1%–5%)
higher than the ‘baseline’ activity [106]. The model proposed here utilizes
a technique developed in the machine learning community that tries to pick
those “important” dimensions in the data, and is based on the principle of
parsimony.
The requirement of analyzing large and/or high-dimensional, real-world
datasets makes the use of sampling methods difficult and leads to the need
for an alternative, efficient Bayesian formulation. Since exact inference and
learning in such a model is intractable, we follow a structured variational
Bayesian approach in the conjugate-exponential family of distributions, for
efficient unsupervised learning in multi-dimensional settings. The algorithm
10
is intended to be used as an efficient fully Bayesian method for exploratory
analysis and modelling of scientific and statistical datasets and for decom-
posing multivariate timeseries, possibly embedded in a larger data analysis
system.
Using fMRI datasets as examples of complex, high-dimensional, spa-
tiotemporal data, we show that the proposed models outperform standard
tools for model-free decompositions such as independent component analysis.
11
Chapter 2
Principles of Data
Decomposition
This chapter provides an introductory background on data decomposition
methods. We start from the classical algebraic method of singular value de-
composition and then introduce principal and independent component anal-
ysis. The chapter continues with the main subject of this Thesis, sparse
representation and decomposition. Finally an introduction to Bayesian in-
ference is given. Specific computational methods for Bayesian inference will
be introduced in the following chapters as needed.
2.1 Probability Theory
When modelling complex systems we are unavoidably faced with imperfect or
missing information, especially in the measurement and information sciences.
This may have several causes, but it is mainly due to
• The cost of obtaining and processing vast amounts of information,
12
• Inherent system complexity.
Probability theory is a conceptual and computational framework for rea-
soning under uncertainty. Probabilities model uncertainty regarding the oc-
curence of random events. Assigning probability measures on uncertain quan-
tities reflects precisely our lack of information about the quantities at hand.
According to Cox’s theorem [40], probability is the only consistent, universal
logic framework for quantitatively reasoning under uncertainty. Moreover,
probability theory offers a consistent framework for modelling and inference.
Jaynes [97] viewed probability theory as a unifying tool for plausible rea-
soning in the presence of uncertainty. From a modeler’s point of view, the
greatest practical advantage of probability theory is that it offers modular-
ity and extensibility: probability theory acts as “glue” for linking different
models together.
2.1.1 Probability Space
The axiomatic formulation of probability starts by defining a probability
space, which is a tuple (Ω, P ) that describes our idea about uncertainty with
respect to a random experiment. It defines:
• A sample space, Ω, of possible outcomes, ωi, of a random experiment
and
• A probability measure, P , which describes how likely an outcome is.
Now, let A be a collection of subsets of Ω, called random events. Then for
A ∈ A the two following conditions must hold:
• Probabilities must be non-negative, P (A) ≥ 0, and P (Ω) = 1,
• Probabilities must be additive: for two disjoint events, A, B,
P (A ∩ B) = P (A) + P (B) .
13
We also define the conditional probability, which can be thought of as “a
probability within a probability”,
P (A|B) =P (A ∩B)
P (B), P (B) 6= 0 .
Then random variables (r.v.’s) are defined as functions from Ω to a range,
R, e.g. a subset of R or N, etc. These can, inversely, define events as:
R→ Ω : A(x) =
ω ∈ Ω :[x(ω)
]
,
where[·]
denotes a “predicate” (e.g. the event ‘x > 2’), and therefore act
as “filters” of certain experimental outcomes.
Probability densities are defined as densities of probability measures:
p(x) =d
dxP (A(x))|x, with A(x) =
x′ ∈ [x, x+ dx]
, x ∈ R .
Finally joint densities (for the case of two random variables X, Y ) are
defined as
pXY (x, y) = p(
ω : X(ω) = x ∧ Y (ω) = y)
.
Joint densities of more than two r.v.’s are defined analogously.
2.1.2 Three Simple Rules
Probability theory is a mathematically elegant theory. The whole construc-
tion can be based on the following three simple rules:
1. The Product rule, which gives the probability of the logical conjunction
14
of two events A and B,
P (A ∩B) = P (A|B)P (B) .
This can be generalized for N events, giving the chain rule
P
(N⋂
i=1
Ai
)
=i−1∏
i′=1
P
(
Ai
∣∣∣∣∣
i⋂
i′=1
Ai′
)
, i′ < i .
This will be valuable for reasoning in Bayesian networks later.
2. Bayes’ rule, which is a recipe that tells us how to update our knowledge
in the presence of new information, and can directly be derived from
the definition of conditional probability and the product rule,
P (A|B) =P (B|A)P (A)
P (B), P (B) 6= 0 .
3. Marginalization: given a joint density, pXY (x, y), get the marginal den-
sity of X or Y by integration (i.e. ‘integrate out’ the uncertainty in
one variable):
pX(x) =
ˆ
Y ∈YpXY (x, y)dy .
In principle, this is everything we need to know in order to perform prob-
abilistic modelling and inference.
15
2.2 Second order decompositions: Singular
value decomposition and Principal Com-
ponent Analysis
Let X be a M ×N rectangular matrix
XM×N
=
x1,1 · · · x1,N
.... . .
...
xM,1 · · · xM,N
and assume without loss of generality that M ≥ N . The singular value
decomposition (SVD) is a factorization of matrix X such that
X = USVT , (2.1)
where the M ×M orthogonal matrix U =[ui
]is called the left eigenvector
matrix of X, and the N × N orthogonal matrix VT =[vT
i
]is its right
eigenvector matrix. The square roots of the N eigenvalues of the covariance
matrices1 XXT and XTX are the singular values of X, σi =√λi, forming
the diagonal matrix S = diag (σi). The singular values are nonnegative and
sorted in decreasing order, such that σ1 ≥ σ2 ≥ · · · ≥ σN , forming the
spectrum of X.
The singular value decomposition of X can be also written as
X =r∑
i=1
σiuivT
i , (2.2)
1Note that XTX = VS2VT and XXT = US2UT.
16
where ui is the i–th eigenvector of XXT and vi is the i–th eigenvector of
XTX, as above, and r ≤ N is the rank of X. In other words, a matrix, X,
can be written as a linear superposition of its eigenimages, i.e. a sum of the
outer products of its left and right eigenvectors, uivT
i , weighted by the square
roots of the eigenvalues, σi. The important fact here is that often relatively
few eigenvalues contain most of the ‘energy’ of matrix X. Now if r < N , the
energy of a data matrix, X, can be captured with fewer variables than N ,
since the relevant information is contained in a lower-dimensional subspace of
the measurement space. This is a form of dimensionality reduction. Note that
due to the presence of noise in the data we may actually have r = N , though.
This, however, also hints at a denoising scheme in which one regards the
smaller eigenvalues as corresponding to the noise and then forms a truncated
SVD, Xt = UtStVT
t , where t < r. Xt is the unique minimizer of ‖X −Xt‖among all rank-t matrices under the Frobenius norm and also a minimizer
(perhaps not unique) under the 2–norm.
Remark 1 For many important applications, such as fMRI and other biosig-
nals, the signal of interest represents only a small part of what is measured
(Lazar, [106]). Consequently, an optimization criterion that searches for
components with maximum signal power, such as PCA, will fail to recover
the signals we are looking for. Methods that exploit higher-order statistics in
the data are therefore needed. Second-order methods can still be very useful
as a preprocessing step, however, and are often used as such.
17
2.3 Higher-order decompositions: Indepen-
dent Component Analysis
In this section we review the independent component analysis (ICA) ap-
proach to source separation, with an emphasis on the aspect of non-gaussianity.
Methodological and review literature includes [148], [87], [160], [75]. Addi-
tional resources are given below.
In the noiseless setting, the observation model for ICA is
x = As , (2.3)
where we have assumed that the observations have been de-meaned (i.e. we
have translated the coordinate system to the data centroid). ICA employs the
principle of redundancy reduction (Barlow, [11]) embodied in the requirement
of statistical independence among the components (Nadal and Parga, [130]).
In statistical language, this means that the joint density factorizes over
latent sources:
P (s) =L∏
l=1
Pl(sl) , (2.4)
where P (s) is the assumed distribution of the sources, s = (s1, . . . , sL), re-
garded as stochastic variables, and pl(sl) are appropriate non-gaussian priors.
Non-Gaussianity is the defining characteristic of the ICA family with respect
to PCA. We seek non-Gaussian sources for two, related, reasons:
• Identifiability,
• “Interestingness”.
Gaussians are not interesting since the superposition of independent sources
tends to be Gaussian. The concept of interestingness is directly exploited
in the related method of Projection Pursuit (Friedman and Tukey, [69]),
18
where the goal is to find the projection directions in a data set that show the
least Gaussian distributions. An important result relating to the probability
densities of the individual sources, due to Comon and based on the Darmois-
Skitovitch theorem2 [43], formalizes the above and states that for analysis
in independent components at most one source may be Gaussian, in order
for the model to be estimable [39]. Geometrically, this indeterminacy of
Gaussian point clouds is due to the rotational invariance of the Gaussian
distribution under orthogonal transformations (Hyvarinen, [90]). Gaussian
point clouds are optimally described in terms of the PCA decomposition
method. Figure 2.1 (left) Lewicki and Sejnowski (right). This, geometric
view will be extended in section 5.2.
A related concept is that of linear structure (Rao, [145]; Beckmann and
Smith, [15]). A vector, x, is said to have a linear structure if it can be
decomposed as x = µ+ As, where s is a vector of statistically independent
random variables and the matrix A is of full column rank. Beckmann and
Smith use results from Rao [145] in order to ensure uniqueness of their ICA
decomposition. In particular, they use the fact that conditioned on knowing
the number of sources and the assumption of non-Gaussianity, there is no
non-equivalent decomposition into a pair (A, s), that is a decomposition with
mixing matrix that is not a rescaling and permutation of A.
Equation (2.4) is equivalent to minimizing the mutual information among
2The Darmois-Skitovitch theorem reads:
Theorem 1 (Darmois-Skitovitch) Let ξ1, . . . , ξn be independent random variables andlet αi and βi, i = 1, . . . , n be nonzero real numbers such that the random variables∑n
i=1 αiξi and∑n
i=1 βiξi are independent. Then the ξi’s are Gaussian.
See, for example, V. Bogachev, ‘Gaussian measures’ [19], p. 13.
19
Figure 2.1: Seeking structure in data. Component analysis can be viewedas a family of data representation methods. The challenging task is to find“informative” directions in data space. These correspond to the column vec-tors of the observation (transformation) matrix and form a new coordinatesystem. The directions are non-orthogonal in general. (Left) Rotational in-variance of the distribution of independent Gaussian random variables. Ascatterplot (point cloud) drawn from two such Gaussian sources illustratesthe fact that there is not enough structure in the data in order to find char-acteristic directions in data space. Algebraically, we can only estimate thelinear map up to an orthogonal transformation. (Center) Point cloud gener-ated from non-Gaussian distribution. (Right) The data cloud contains morestructure in this case, which we want to exploit. In particular, the geometricshape of the point cloud of this figure is an example of a dataset that issparse with respect to the coordinate axes shown by the two arrows.
the inferred sources3 [16],
min I(s1, . . . , sL), where
I(s1, . . . , sL) =´
ds1 · · ·´
dsL p(s1, . . . , sL) log p(s1,...,sL)Q
l p(sl)
,
3For two stochastic variables X and Y to be independent, it is necessary and sufficientthat their mutual information equals zero:
I(X, Y ) = H(X) + H(Y )−H(X, Y )
=
ˆ
dXdY PX,Y (X, Y ) log PX,Y (X, Y )
−ˆ
dXPX(X) log PX(X)−ˆ
dY PY (Y ) log PY (Y ) = 0 ,
where the quantity H(Z) is the ‘differential’ entropy of the random variable Z.
20
or, equivalently, the distance between the distribution p(s) and the fully fac-
torized one,∏
l p(sl), measured in terms of the Kullback-Leibler divergence,
KL[p(s)||∏l p(sl)
]. This enables them to separate statistically independent
sources, up to possible permutations and scalings of the components [39].
The mutual information (“redundancy”) can be equivalently computed as
I(s1, . . . , sL) =
L∑
l=1
H(sl)−H(s1, . . . , sL) ,
where the first term at the rhs is the sum of the entropies of the individual
sources and the second the joint entropy of (s1, . . . , sL). As shown by Bell &
Sejnowski (1995), independence can lead to separation because the method
exploits higher-order statistics in the data, something that cannot be done
with methods such as PCA.
In practice, many ICA algorithms minimize a variety of ‘proxy’ func-
tionals. Bell and Sejnowski’s ICA approach uses the InfoMax principle
(Linsker, [113]), maximizing information transfer in a network of nonlinear
units (Bell & Sejnowski, [16]). Based on this, Bell and Sejnowski derive their
very successful Infomax-ICA algorithm. The sources are estimated as
s = u = Wx , (2.5)
where W is the separating (unmixing) matrix that is iteratively learned by
the rule
W←W + η(
I− E[φ(u)
]uT
)
W , (2.6)
where we have incorporated Amari et al.’s natural gradient descent approach
[2]. The vector valued map φ(u) = (φ1(u1), . . . , φL(uL)) is an appropriate
nonlinear function of the output, u, such as a sigmoidal ‘squashing’ function,
applied component-wise. Popular choices are the logistic transfer function,
21
φ(u) = 11+e−u , and hyperbolic tangent, φ(u) = tanh(u). Bell and Sejnowski
show that optimal information transfer, that is maximum mutual informa-
tion between inputs and outputs, or equivalently maximum entropy for the
output, is obtained when highly-slopping parts of the transfer function are
aligned with high-density parts of the probability density function of the in-
put. The expectation operator, E[ · ], is approximated by an average over
samples in practice. Finally, the factor η is an appropriate learning rate.
Hyvarinen chooses to focus explicitly on non-Gaussianity and derives a
fixed-point algorithm, dubbed FastICA [89]. Non-Gaussianity can be quan-
tified using the negentropy, J ,
J(u) = H(uGauss)−H(u) ,
where uGauss is a Gaussian random variable with the same covariance as u.
The FastICA algorihm maximizes an approximation of J using the estimate
J(ul) ≈
E[G(ul)
]− E
[G(uGauss)
]2
,
where G(·) is an appropriate nonlinearity, such as the non-quadratic function
G(z) = z4, and implicitly related to the source distributions (see below) ,
uGauss is a standardized Gaussian r.v., and u1, . . . , ul, . . . , uL are also of mean
zero and unit variance. The unknown sources, ulLl=1, are again estimated
using the projections ul = wT
l x, where wl is the l–th separating vector
(column of W), found by the iteration
w← E[xg(wTx)
]− E
[g′(wTx)
]w ,
where g(·) is the derivative of G(·), and w is each time rescaled as w ←w
‖w‖ . For an application of the non-Gaussianity principle to fMRI see the
22
Probabilistic ICA algorithm of Beckman and Smith [15].
The ICA decomposition can be written as a factorization of an “unfold-
ing” matrix times a rotation matrix. The former is usually implemented
by pre-whitening (pre-sphering) the observations, such that E[xxT
]= ID,
where x now denotes the whitenned observations:
x = Wsphx .
Wsph can be computed from the eigendecomposition of the data covariance
matrix, Cxx = E[xxT
] .= UΛUT, where the matrix U is a unitary matrix
containing the eigenvectors of Cxx and Λ = diag(λ1, . . . , λD) is the diago-
nal matrix of eigenvalues. Then the decomposition problem can be written
(taking the “square root” and inverting) as
x = Λ− 12 UTAs = WsphAs = As, i.e. A = W−1
sphA .
That Λ− 12 UT spheres the data can be seen by simply performing the opera-
tions for E[xxT
], taking into account that U is an orthogonal matrix [81].
The above whitening operation transforms the original data vectors to the
space of the eigenvalues and rescales the axes by the singular values. Alter-
natively, one may use UΛ− 12UT for whitening, which maps the data back to
the original space. This often makes further processing easier. In any case,
since the whitening transformation removes any second-order statistics (cor-
relations) in the data, learning the ICA matrix A is equivalent to learning a
pure orthogonal rotation matrix:
E[xxT
]= AE
[ssT]AT = AAT = I .
23
2.3.1 Probabilistic Inference for ICA
Many classical component analysis algorithms can be interpretated under the
same probabilistic framework, as generative models. The model we consider
here is the transformation
x = As + ε , (2.7)
where an L–dimensional vector of latent variables, s, is linearly related to a
D-dimensional vector of observations via the observation operator A. Obser-
vation noise, ε, may in general be added to the observations. In other words,
the observed data is ‘explained’ by the unobserved latent variables, while the
mismatch between the observations and the model predictions, x − As, is
explained by the additive noise. The fundamental equation of ICA, which
we write again below,
P (s) =L∏
l=1
Pl(sl) , (2.8)
can be seen as a modelling assumption (i.e. a working hypothesis), as a fac-
torization of a multi-dimensional distribution into a product of simpler one-
dimensional distributions, in another interpretation. Classical ICA models
such as Infomax ICA and FastICA assume noiseless and square mixing. This
restriction is removed in more recent algorithms.
Remark 2 The generative model of Eqns (2.7), (2.8) defines a constrained
probability distribution in data space. Referring back to Fig. 2.1, the “arms”
of the point-cloud are oriented along the directions of the “regressors”, which
are encoded in the column vectors of the mixing matrix. Thus, when defining
and learning a probabilistic ICA model, we are are in fact defining at least
three things: the source distributions, the mixing matrix, and the noise model.
The above remark will prove to give an insight into why ICA algorithms are
so successful in decomposing certain types of data such as fMRI later.
24
In the general, noisy and non-square mixing case, one can formulate the
penalized optimization problem (see e.g. [118], [26], [141], [169], and [163]
for a nice concise review)
s = argmaxs
− 1
2σ2‖x−As‖2 +
L∑
l=1
log pl(sl)
, (2.9)
assuming spherical Gaussian noise, ε ∼ N (0, σ2IL), for example, in order to
reconstruct the sources from the inputs at their most probable value.
As shown by MacKay [118] and Pearlmutter and Parra [141], Infomax-
ICA can be interpreted as a maximum likelihood model. Assuming square
mixing (i.e. as many latent dimensions as observations, L = D), and invert-
ibility of the mixing matrix, the separating matrix is W = A−1. We can
then immediately write down the probability of the data, as
p(x) = | det(J)|p(s) ,
where J is the Jacobian matrix of the transformation, with Jl,d = ∂sl
∂xd. Using
the fundamental assumption of ICA, of mutual independence of the latent
variables, p(s) =∏L
l=1 p(sl), and under the linear model we have
p(x) = | det(W)|∏
l
p(sl) .
The log–likelihood of an i.i.d. data set, X = xnNn=1, can then be written
as
L(θ)def= log p(X |θ) = log
(N∏
n=1
p(xn|θ))
= N log | det(W)|+N∑
n=1
L∑
l=1
log(pl
(wT
l xn
)),
where we have substituted sl,n with ul,n = wT
l xn =∑D
d=1 wl,dxd,n. The pa-
25
rameter vector, θ, here contains the matrix, A, or equivelently the unmixing
one, W = A−1, since these are uniquely related here.
We can now derive a maximum likelihood algorithm for ICA via gradient
descent, in order to learn the separating matrix, W. Taking the derivative
of L(θ) with respect to W and using well-known derivative rules we finally
find the learning rule
∂
∂WldL(θ) = Ald + zlxd ,
where we have used the shorthand notation zl = φl(ul), where the ICA
nonlinearity is the score function of the sources, φl(sl) = − ∂∂sl
log pl(sl),
where pl(sl) are the assumed source priors. Multiplying with WTW, to
make the algorithm covariant [118], we get exactly the Infomax-ICA update
rule, Eq. (2.6). Note that the above multiplication is equivalent to using the
‘natural gradient’ approach of Amari [2], a learning algorithm based on the
concept of information geometry.
The FastICA algorithm can be interpreted as an instance of the EM
algorithm [48]. Lappalainen [103] derives it as an algorithm that filters
Gaussian noise. The general idea of the EM algorithm is to estimate the
latent variables, Y, and model parameters, θ, of a probabilistic model (which
in this case are the sources, S, and mixing matrix, A, of the BSS problem), in
two alternating steps. The ‘E’ (expectation) step computes the expectation
of the log–likelihood with respect to the posterior distribution p(
Y∣∣∣X, θ(τ)
)
using the current (τth) estimate of the parameters, giving the so-called ‘Q’–
function,
Q(
θ∣∣∣θ
(τ))
= EY|X,θ(τ)
[
logL(θ;X,Y)]
,
which is a function of θ only, and the ‘M’ (maximization) step, which com-
26
putes the model parameters that maximize the expected log–likelihood,
θ(τ+1) = argmaxθ
Q(
θ
∣∣∣θ
(τ))
.
Applyting the above generic EM recipe, we can compute the maximum
likelihood estimate of the mixing matrix of our ICA model as
A =(
X E [S]T) (
E[SST
])−1,
where the expected sufficient statistics4 of the sources, E [S] and E[SST
],
are computed with respect to their posterior5. In the low-noise (σ2 → 0)
and square-mixing case of FastICA, Lappalainen approximates the posterior
mean as
s = E[s∣∣A,x, σ2ID
]≈ s0 + σ2
(ATA
)−1φ(s0) ,
where s0def= A−1x and the function φ(·) is defined as before. For prewhitened
data, this expression simplifies even more, since A is orthogonal, and there-
fore,(ATA
)−1= IL. Now Lappalainen makes the crucial observation that
while the EM algorithm has not yet converged to the optimal values, the
sources, s0, can be written as a mixture
s0 = αsopt + βsG ,
where the “noise” sG is mostly due to the other sources not having been
perfectly unmixed. (When far from the optimal solution, we have β ≈ 0
and α ≈ 1.) Using an argument based on the central limit theorem, as
the number of the other sources becomes large he approximates the mixing
4A sufficient statistic is the minimal statistic that provides sufficient information abouta statistical model. Typically, the sufficient statistic is a simple function of the data, e.g.the sum of all the data points, sum of squares of the data points, etc.
5These are relationships that will become useful later in the Thesis as well.
27
matrix corresponding to those other sources as
aG ≈ a + σ2XGφ(s0G)T/L ,
where XG are Gaussian-distributed “sources” with the same covariance as
X, as is done in the standard FastICA algorithm, and the sources s0G are
s0Gdef= aTXG. Then the update equation for the mixing matrix, normalized
to unity, is
anew =a− aG
‖a− aG‖≈ σ2
[Xφ(s0)
T −XGφ(s0G)T]/L
‖a− aG‖.
The final step that will bring us to the standard FastICA is to note that the
term XGφ(s0G)T/L is equal to as0Gφ(s0G)T/L, where the factor s0Gφ(s0G)T/L
is constant, and therefore the numerator of the update equation becomes the
standard FastICA update, a− aG = Xφ(s0)T − ca.
While Teh [163] computes the data likelihood in a maximum likelihood
framework, Knuth [99] uses a maximum a-posteriori framework. The latter
allows us to impose constraints on the model parameters as well. This was
further explored in Hyvarinen and Karthikesh in [88]. in order to impose
sparsity on the mixing matrix.
Up to now we have either assumed equal number of sources and sensors
or we have implicitly assumed that their number is somehow given. Roberts
[147] derives a Bayesian algorithm for ICA under the evidence framework that
estimates tha most probable number of sources as a model order estimation
problem. The evidence framework makes a local Gaussian approximation to
the likelihood conditioned on the mixing matrix, but takes into account the
local curvature by estimating the Hessian. Due to computational reasons,
this is approximated by a diagonal matrix here, setting the off-diagonal ele-
ments to zero. The noise width, regarded as a hyperparameter, is computed
28
at its maximum likelihood value.
Finally, Choudrey et al. [36] and Miskin and MacKay [128] propose a fully
Bayesian approach to ICA using a variational, ensemble learning approach
under a mean-field approximation. They use a flexible source model based
on mixtures of Gaussians and perform model order estimation using a variety
of techniques. The variational Bayesian approach to probabilistic inference
will be further elaborated upon in Chapter 5.
We can now select an appropriate functional form for the individual
marginal distributions, pl(sl,n), based on our prior knowledge about the prob-
lem, as was done in the original formulation of InfoMax ICA of Bell & Se-
jnowski for the separation of speech signals, for example. The source model
should model the real source distributions as accurately as possible. Many
natural signals exhibit characteristic amplitude distributions, which can pro-
vide some guidance and indeed should be exploited when possible. This
allows us to utilize fixed source models in our separation algorithms. Bell
and Sejnowski, for example, use several nonlinearities (recall that these are
uniquely related to the assumed PDFs of the sources), such as 1/(1 + e−ui),
tanh(ui), e−u2
i , etc., as well as propose general-purpose ‘score functions’ (see
Figure 2 of Ref. [16]) in their Infomax-ICA algorithm. FastICA uses nonlin-
earities such as u3i , tanh(αu), uie
−αu2i /2, and u2
i . However, this is not always
possible. The problems that can arise from an incorrect latent signal model
and possible solutions are discussed in section 2.5.
As noted by Cardoso [28], however, non-Gaussianity is not the only possi-
ble route to independent component analysis, and to blind source separation
in general; other possibilities also exist—including exploiting non-stationarity
and time-correlation in signals. Such a different paradigm, sparsity, in com-
bination doing away with the assumption of independence, will be explored
next.
29
2.4 Sparse Decompositions
2.4.1 Parsimonious representation of data
ICA works well for a variety of blind source separation problems. However,
in order for the decomposition to make sense the true sources must them-
selves be (nearly) independent. This may make sense in the separation of
voice signals that are independently generated by people with no interaction
among them, for example. Recently, another paradigm for BSS, and inverse
problems in general, sparsity, has emerged. Sparsity refers to the property
of a representation to form compact encodings of signals, data, or functions,
using a small number of basis functions. Those basis functions are used as
“building blocks” to build more complex signals.
There has been a variety of algorithms for sparse representation, or sparse
coding, originating from the neuroscience and neural networks communities
as well as several others from a signal processing perspective. Sparse decom-
position, and ways to impose sparsity constraints, has recently been topic of
much research in the statistics and machine learning literature.
In the study of the visual system, Field [62] proposed sparsity as an
organization principle of the visual receptive field. He conjectured that pop-
ulations of neurons optimize the representation of their visual environment
by forming sparse representations of natural scenes, a hypothesis that has
high biological plausibility since it is based on the general idea of a system
using its available resources efficiently. According to his theory, the visual
system performs efficient coding of natural scenes in terms of natural scene
statistics by finding the sparse structure available in the input. Field’s theory
directly reflects the idea of redundancy reduction of Barlow [11], [12].
Olshausen and Field [137] further test the above theory, seeking experi-
mental evidence for sparsity in the primary visual cortex (V1) by building a
30
Figure 2.2: The Olshausen and Field model [137] as a network. The outputsof the network are the coefficients, ai. The symbol r(x) is the residual image,r(x) =
[I(x)−∑i aiφi(x)
].
predictive (mathematical) model of sparse coding. In their model, images are
formed as a linear combination of local basis functions with corresponding
activations that are as sparse as possible. These bases model the V1 receptive
fields and form overcomplete sets adapted to the statistics of natural images.
Formally, the model of Olshausen and Field is described by:
xp ≃∑
i
ai,pφi ,
where xp is an image “patch” (i.e. a small image window) and φi are the
underlying basis elements. A network representation of their model is shown
in Fig. 2.2. They proposed the following objective:
I(Φ) = minai,p
∑
p
∥∥∥∥∥xp −
∑
i
ai,pφi
∥∥∥∥∥
+ λ∑
i
log(1 + ai,p
2)
,
to be minimized over bases, Φ, learned by searching for a basis that opti-
mized the sparsity of the coefficients, ai,p, (subject to the appropriate scale
31
Figure 2.3: Activities resulting from the model of Olshausen and Field [137].The input image on the left is reconstructed from learned bases using thealgorithm of O. & F. Note how the coefficients, ak, resulting from the modelare highly sparse, compared to reconstructing the image patch using randombases or pixel (canonical) bases.
normalization of φi). In general, the basis set can be overcomplete. That
is, the number of bases in Φ, |Φ|, can be greater than the dimensionality of
the ‘input’ data space (see for example [138]). The reason for this is that
the ‘code’ can be more sparse if one allows an overcomplete basis set. See
also Asari, [3]. As shown in Fig. 2.3 this objective results in highly sparse
distributions for the coefficients. Astonishingly, the learned receptive fields
(filters) have properties that resemble the properties of natural simple-cell
receptive fields, that is they are spatially localized, oriented and bandpass,
i.e. selective to structure at different spatial scales (Fig. 2.4).
In the signal processing community, Mallat and Zhang [122] proposed
a greedy algorithm analogous to the projection pursuit in statistics, called
‘matching pursuit’, that iteratively finds the best matching projections of
signals onto a fixed overcomplete dictionary of time-frequency ‘atoms’. Lin-
ear combinations of those atoms form compact representations of the given
signal.
32
Figure 2.4: Learned receptive fields (filters) from the sparse coding algorithmof Olshausen and Field [137]. These filters exhibit properties of simple-cellreceptive fields such as locality, orientation and spatial selectivity.
Remark 3 A geometric interpretation of sparse representation is depicted
in Fig. 2.5. Each data vector can be viewed as a point in a D–dimensional
vector space, the whole dataset forming a cloud of points. We now seek a
linear transformation of the dataset such that the inferred “projections” on
to the new coordinate system defined by the column vectors of the learned
transformation matrix, A =[al
]L
l=1, are as sparse as possible.
Note that it is the sparseness of the components (and the selection of a suit-
able model prior) that drives learning of the new representation (unmixing)
directions. This sparseness is reflected in the shape of the point-cloud: re-
ferring to the above figure (where D = L.= 2), sparse data mapped in to the
latent space produce a highly-peaked and heavy-tailed distribution for both
axes (Fig. 2.5 (right)). This is indeed a result of the sparseness property of
the dataset: the two ‘arms’ of the sparse data cloud are tightly packed around
the directions of the unmixing vectors, al. Algebraically, this means that for
a particular point, n, either the coefficient s1,n (l = 1) or the coefficient s2,n
(l = 2) is almost zero, as the particular datum is well described by the a2 or
33
Figure 2.5: Geometric interpretation of sparse representation. State-spacesand projections of two datasets, one sparse (right) and the other non-sparse(left), are shown. Each dataset, plotted in the measurement coordinate sys-tem, (A,B), (‘state-space’ in the terminology of Field), produces a pointcloud (top part of the figure) — for visualization purposes, both observationand latent dimensionalities are equal to D = L = 2 in this figure. By project-ing the point clouds on to each coordinate we can produce the correspondingempirical histograms of ‘state’ amplitudes. We now seek a (linear) trans-formation to a latent space, (A′, B′), that optimizes some suitable criterion(this is shown in the bottom part of the figure). Sparse data mapped in thelatent space produce heavy-tailed distributions for both latent dimensions.(Figure from D. J. Field, Phil. Trans. R. Soc. Lond. A, 1999.)
the a1 “regressor”, respectively. On the contrary, non-sparse data will typi-
cally produce a projection that corresponds to a “fat” empirical histogram,
as shown in Fig. 2.5 (lower-left).
With respect to the soft clustering view of component analysis (Miskin,
[94]), discussed in the Introduction of the Thesis, if the data vectors are
sufficiently sparse, their images on the unit hypersphere SD−1 (i.e. the ra-
dial sections of their position vectors with the unit hypersphere, mapped as
xn ∈ ED 7→ xn ∈ S
D−1, where the operator u = u‖u‖ maps vectors along their
34
Figure 2.6: Clustering of a sparse set of points on the unit hyper-sphere SD−1. The points cluster around the direction vectors corre-sponding to the columns of the mixing matrix. (Figure was taken fromhttp://cs.anu.edu.au/escience/lecture/cg/Illumination/.)
radii) concentrate around the unit vectorsal
L
l=1; see Fig. 2.6 [176]. While
Miskin did not use this property per se for sparse decomposition, one can
design separation algorithms that exploit it [111]. We shall revisit this in the
applications section of chapter 3.
2.4.2 Sparse Decomposition of Data Matrices
Inspired by the model of Olshausen and Field, Donoho [50] comments on
the connection and differences between the two lines of research: indepen-
dent component analysis and sparse decompositions. He promotes the idea
of sparsity, overcompleteness, and optimal atomic decompositions as a better
goal than independence. He provides a rationale of why sparsity is a more
plausible principle, being “intrinsically important and fundamental”, due to
both biological and modelling reasons. Regarding the former, he too cites
the extremely efficient sparse representation achieved by the human visual
system, and its higher compression performance compared to the best engi-
neered systems. With respect to the latter, he notes that independence is
inherently a probabilistic assumption, and of unknown interpretability (with
35
respect to vision) because natural images are composed by occlusion. Occlu-
sion inevitably creates dependent components. He finally suggests that one of
the future challenges of ‘sparse components analysis’ would be to search over
spaces of objects of much larger scale than the image patches of Olshausen
and Field.
It turns out (see Olshausen, [138]) that the Infomax-ICA algorithm be-
comes, in fact, a special case of the sparse linear algorithm of Olshausen and
Field when there is an equal number of basis functions/latent dimensions
and inputs, the φis are linearly independent, and there is no observation
noise. In this case, there is a unique set of coefficients ai that is the root
of ‖X −Φa‖, and we can write a as a = WX, where W = Φ−1 (note that
based upon the above assumptions, Φ becomes invertible). If, in addition,
the ICA nonlinearity is chosen to be the cumulative density function of the
sparse components, then the sparse algorithm gives exactly the algorithm of
Bell and Sejonwski. The point here is actually to show that sparsity con-
straints can lead to separation. Indeed, as pointed out by Li, Cichocki and
Amari [111],
Remark 4 Sparse decompositions of data matrices can be used for the blind
source separation problem.
They provide various examples from simulations and EEG data analysis that
demonstrate the performance of sparse decompositions in signal separation.
Li, Cichocki and Amari performed a sophisticated mathematical analysis for
the case of sparse representation of data matrices under the ℓ1 prior, for given
basis matrices. They tackle the two-step decomposition problem of learning
the base matrix first, via clustering, and then estimating the coefficients of
the decomposition. If X is a data matrix and A = al is a given basis, Li
36
et al. start from the mathematical model shown below:
min L∑
l=1
N∑
n=1
|sln|︸ ︷︷ ︸
S(S)
∣∣∣ subject to AS = X
, (2.10)
with S(·) the sparsity function on the sources. This particular case of opti-
mization problem can then be solved using linear programming. While the
ℓ0–norm solution is the sparsest one in general, its optimization is a non-
trivial combinatorial problem. Li et al. show that, for sufficiently sparse
signals, the solutions to the problem of sparse representation of data matri-
ces that are obtained using the ℓ0 and ℓ1 norms are equivalent. This fact
was previously shown by Donoho and Elad [51] but Li et al. give a less strict
sparseness ratio (i.e. the ratio of zero versus non-zero elements).
Li et al. [111] also show that the above problem has a unique solution.
While in general there would be an infinite number of solutions for the un-
derdetermined system of equations
As = x ,
where the D × L matrix A (observation operator) with L > D maps the
unknown signal s in to the observed signal x, the sparsity constraint makes
the particular linear inverse problem well-posed. A geometric interpretation
of why ℓ1–type sparsity regularization works well for signal recovery under
sparsity constraints is shown in Fig. 2.7. We want to find the optimal x as
the minimum-norm vector that satisfies the constraint x = As, i.e. such that
the hyperplane does not intersect the ℓ1 ball. More generally, the problem
37
Figure 2.7: A geometric intuition into sparse priors. We seek the sparsestvector x ∈ R
N , under the ℓ1 norm in this case, that satisfies the linearconstraint y = Φx, where Φ is a dictionary. The ℓ1 penalty correspondsgeometrically to the rhomboid (‘ℓ1 ball’) and the linear constraint to thehyperplane. The shape of the rhomboid dictates the form of the solution.The optimal vector, x, is the one that touches the hyperplane, without thelatter intersecting the rhombus. As can be seen from the figure, the ℓ1 normnecessarily drives all coordinates but one towards zero, leading to sparsesolutions. (Figure from Baraniuk [9].)
38
can be stated (in the deterministic framework) as:
mins
‖s‖1 : ‖As− x‖ < c
(Chen and Haykin, [33]), where x can be a “corrupted” (noisy, blurred, etc)
version of the original signal and c is a positive scalar constant that plays
a role similar to the noise variance in the probablistic framework (Li et
al., [111]). In this case, the hyperplane becomes an orthotope (hyperrect-
angle), defining a “zone” in which the vertex of the ℓ1 ball must fall. Li et
al. [111] use k–means clustering to get an estimate of the basis, which is then
used in a linear programming algorithm in order to estimate the coefficients
of the representation.
Lewicki and Sejnowski [109], introduce a probabilistic method for sparse
overcomplete representations. A Laplacian prior on the coefficients of the
basis was used, p(sl) ∝ e−θ|sl|, enforcing parsimonious representations. They
then propose a gradient optimization scheme for maximum a-posteriori (MAP)
learning. For the linear model x = As + ε, with Gaussian observation noise
with variance σ2, we seek the most probable decomposition coefficients, s,
such that
s = argmaxs
p(x|A, s)p(s)
. (2.11)
The probability of a single data point is obtained by integrating out the
unknown signals, s:
p(x|A) =
ˆ
ds p(s)p(x|A, s) .
In order to derive a tractable algorithm, they make a Laplace approxima-
tion to the data likelihood, by assuming that the posterior is Gaussian
around the posterior mode. This involves computing the Hessian H =
39
∇s∇s − log [p(s)p(x|A, s)] = 1σ2 A
TA −∇s∇sp(s). To make a smooth ap-
proximation of the derivative of the log–prior, and a diagonal approximation
to the Hessian, they then take p(sl) ≈ cosh−θ/β(βsl), which asymptotically
approximates the Laplacian prior for β → ∞. Moreover, a low noise level
is assumed. The above approximations finally lead to the gradient learning
rule
∆A = ATA ∇A log p(x|A) ≈ −A(I + zsT
),
where, again, zl = ∂ log p(sl)/∂sl. Note that this has the same functional
form as the Infomax-ICA learning rule, however the basis matrix is generally
non-square here. In contrast to the standard ICA learning rule, where the
sources are estimated simply by s = Wx, where the unmixing matrix is
W = A−1, here we must use a nonlinear optimization algorithm in order to
estimate the coefficients, using Eq. (2.11). Due to the low-noise assumption,
the level of the observation noise is not estimated from the data and has
to be set manually. Lewicki and Sejnowski’s algorihm, however, is faster in
obtaining good aproximate solutions than the linear programming method
and is more easily generalizable to other priors.
Girolami [76] proposes a variational method for learning sparse represen-
ations. In particular, his method offers a solution to the problem of analyti-
cally integrating the data likelihood, for a range of heavy-tailed distributions.
Starting from the heavy-tailed distribution p(s) ∝ cosh− 1β (βs), he derives a
variational approximation to the Laplacian prior by introducing a variational
parameter, ξ = (ξ1, . . . , ξL), such that the prior p(s) =∏L
l=1 exp(−|sl|) be-
comes p(s; ξ), with s|ξ ∼ N (s; 0,Λ) and Λ = diag (|ξl|). Then p(s) is the
supremum
p(s) = supξ
[L∏
l=1
ϕ(ξl)
]
N (s; 0,Λ)
,
40
with ϕ(ξ) = exp(−12|ξ|)√
2π|ξ| as β → ∞. The above is derived using a
variational argument and using convex duality [95], [139]. In essence, what
this approximation means is that, at each point of its domain, the intractable
prior is lower-bounded tightly by a best-matching Gaussian with width pa-
rameter ξ, with this variational parameter being estimated by the algorithm
along with the model parameters. Using the above, the posterior takes a
Gaussian form. This enables him to derive an EM algorithm in order to infer
the sparse coefficients and learn the overcomplete basis vectors of the repre-
sentation. Girolami applies his sparse representation algorihm to the problem
of overcomplete source separation and achives superior results compared to
the algorithm of Lewicki and Sejnowski.
The problem of sparsely representating a data matrix described above is
a special case of the more general problem of recovering latent signals that
themselves have a sparse representation in a signal dictionary (Zibulevsky
et al., [176]). Many real-world signals have sparse representations in a
proper signal dictionary but not in the physical domain. The discussion
in Zibulevsky et al. is motivated by starting from the case of representing
sparse signals in the physical domain, depicted in Fig. 2.5, and then noting
that the intuition there carries over to the situation of sparsely recovering sig-
nals in a transform domain. The latter problem is one of the main problems
dealt with in this Thesis and is elaborated in Chapter 3.
2.5 The Importance of using Appropriate La-
tent Signal Models
Many classical ICA algorithms, such as Infomax-ICA and FastICA, allow
the plug-in setting of the respective nonlinearity function in the system, as
mentioned above. For successful separation, the form of the nonlinearity
41
-30
-20
-10
0
10
20
30
-40 -30 -20 -10 0 10 20 30 40
True distribution
-10
-5
0
5
10
-10 -5 0 5 10
Hypothesized distribution
-30
-20
-10
0
10
20
30
40
-30 -20 -10 0 10 20 30
Estimated distribution
Figure 2.8: Effect of an incorrect source model specification. Left: true dis-tribution; Middle: Hypothesized distribution; Right: Estimated distribution.(Figure adapted from Cardoso [27].)
must somehow match, as far as possible, the underlying (unknown) statistical
properties of the sources, such as their super- or sub-gaussianity. This was
first stated as “matching the neuron’s input-output function to the expected
distribution of the signals” in [16]. Since the estimating equations for the
mixing matrix and sources are coupled, the functional form of the nonlinear-
ity is critical for their correct estimation: an incorrect choice of nonlinearity
will lead to an incorrect estimation of the (un-)mixing matrix, which will map
the observations back to the source space incorrectly, etc. Cardoso [27] gives
a compelling example of how estimation can go wrong. Another example of
how classical ICA fails in separating sources in an image processing context
is given in Fig. 2.9 (from Tonazzini et al., [167]).
Remark 5 Tonazzini et al. use a Markov random field in order to impose
an image prior. However, the images of Fig. 2.9 (left) are actually also
prime examples of sparse sources. In [77] and [46], an extensive study of
how justified and robust are ICA algorithms for functional MR imaging of
the brain was conducted and various simulations of fMRI “brain” activations
under well-controllable situations with shapes similar to that of ref. [167] were
performed that highlighted the need for alternative decomposition algorithms
that are effective for fMRI, based on sparsity. These are also studied in the
experimental section of Chapter 3 using a proposed new algorithm for sparse
BSS.
42
ICA-reconstructed source s2
ICA-reconstructed source s1
Noisy mixture x2
Noisy mixture x1
Source s2
Source s1
Figure 2.9: Effect of an incorrect source model specification for a blind im-age separation problem. Left: true sources; Middle: noisy mixtures; Right:Estimated sources from ICA. The model clearly fails to recover the sources.In particular, one of the sources is not recovered at all.
It can be shown that the Infomax-ICA as well as the FastICA algorithms
are instances of maximum likelihood estimation [118], [141], [86], [103].
Under this interpretation, one can see that the nonlinearity, φ(·), is actually
the logarithmic derivative of the (hypothesized) probability density of the
sources (the ‘score’ function): for the l–th source, sl,
l : φl ([Wx]l) = − ∂
∂sllog pl(sl) = −p
′l(sl)
pl(sl),
where the symbol W denotes the separating operator from observation space
to source space and x is an observation. In other words, in a perfect match
the nonlinearity is exactly the cumulative distribution function of the sources.
Of course we do not know the actual source PDFs, since the sources them-
selves are unobserved, but we may try to estimate them from the data. For
this purpose, we can employ a parameterized model source PDF, pl(sl; θsl),
43
and learn, instead of fix, its parameters, θsl, from the data. A flexible prior
that is at the same time mathematically tractable is a mixture distribution.
Lawrence and Bishop [105] uses a MoG prior for ICA, albeit in a fixed form.
Attias [6] has used mixture of Gaussians (MoGs) as source models for blind
source separation under a maximum likelihood framework, leading to a flex-
ible algorithm dubbed ‘Independent Factor Analysis’ (IFA). Choudrey et al.
[36] and Lappalainen [102] use the same prior under an ensemble learning
approach, i.e. with a factorized posterior (the so-called ‘naive’ mean-field
method).
The l–th model source density, for an Ml–component model, can be ex-
pressed mathematically as a linear combination of component densities,
l : pl(sl,n|θsl) =
Ml∑
ml=1
πl,mlN(sl,n;µl,ml
, β−1l,ml
), (2.12)
where (µl,ml, βl,ml
) are the mean and precision (inverse variance) parameters
of theml–th Gaussian component over sl,n and πl,mlMl
ml=1 are mixing propor-
tions. Since the πl,ms are non-negative the MoG is also a convex combination
of Gaussians. An example of a MoG distribution is shown in Fig. 2.11, which
models the natural image in Fig. 2.10.
Each component in the mixture distribution6 may be interpreted as a
‘state’: for the n–th data instance, the state ξl,n = ml ∈ 1, . . . ,Ml is
probabilistically chosen with prior probability p(ξl,n = ml) = πl,ml. The
set θS =⋃
l θsl= µl,ml
, βl,ml, πl,ml
Ml
ml=1 are the source model parameters
that we shall learn. The collective state variable, ξ = (ξ1, . . . , ξL), lives in
a state-space that is the Cartesian product of the state sets corresponding
to the individual sources: ξ ∈ Ξ = 1, . . .M1 × · · · × 1, . . .ML, with
M = |Ξ| = ∏
l Ml. For high-dimensional problems this model may become
6Note the difference between a mixture of probability densities, p = π1p1 +π2p2 . . ., anda mixture of random variables, x = a1s1 + a2s2 . . ., in this context.
44
50 100 150 200 250
50
100
150
200
250
Figure 2.10: ‘Cameraman’ standard test image. The pixel values were scaledto be in the range [0, 1].
0 0.5 10
2000
4000
6000
−1 −0.5 0 0.5 1 1.5 20
1
2
3
4
Figure 2.11: Density modelling using the MoG model. Empirical histogram(left) and estimated PDF by the MoG (right) of the real-space pixel values of‘Cameraman’ image. Three Gaussian components were used for this image,denoted by the dashed curves. The resulting mixture density is denoted bythe thick back curve. The adaptive image prior correctly infers the PDF. Inparticular, notice how the MoG assigns practically zero probability measureoutside the range [0, 1]. This model can be estimated by an EM algorithm,introduced in Sect. 2.3.1.
45
computationally demanding, since the size of the product space can be very
large, and in this case a variational approximation can be utilized [6].
Remark 6 Going back to the ‘Cameraman’ image and its MoG model, we
note that in general, the number of components, Ml, should ideally reflect
the number of “important” objects in the image. However, since the MoG
density model is a map from physical (geometric) space to intensity frequency
space, the geometric information is lost. Therefore, the above is feasible only
if the various objects have quite different radiometry. But see section 5.2.2,
where the MoG is used as a feature-space density model.
In addition to multimodal distributions, a MoG density can also be used
for modelling sparsity. This latent signal model will be elaborated in sec-
tion 5.2.2, where a fully Bayesian model for sparse matrix factorization is
developed.
2.6 Data modelling: A Bayesian approach
In order to obtain the distribution of the data given the model parameters
we must ‘integrate out’ the latent variables,
p(x|A, σ2
)=
ˆ
ds p(s) p(x|s,A, σ2
), (2.13)
where the second factor in the integral, the likelihood of the data under the
model, captures the stochastic map sp7→ x given the parameters of the model,
(A, σ2). Unfortunately, this integration is in general intractable. There are
several ways to approximate p (x|A, σ2), for example by a Laplace approxi-
mation, introducing additional variational parameters, using mixture models,
or by sampling methods.
46
Bayesian inference provides a powerful methodology for CA [147], [99]
by incorporating uncertainty in the model and the data into the estimation
process, thus preventing ‘overfitting’ and allowing a principled method for
using our domain knowledge. The outputs of fully Bayesian inference are
uncertainty estimates over all variables in the model, given the data.
Bayes’ theorem provides a “recipe” for updating our prior assigmnents in
order to incorporate the observed data:
p(M|D; I) =p(D|M; I)p(M|I)
p(D|I) , (2.14)
where M is a model, D is the data, and I is any background information
we may have. The probability p(M|I) is the prior for M, capturing our
knowledge or expectations about the problem and encoding our “weight”
assignments about the plausible range of values that variables in the model
can take prior to observing any data, p(D|M; I) is the likelihood, that is
the probability of the data under our model, and p(M|D; I) is the posterior,
that is the probability of the model after we have observed the data. In the
Bayesian approach to data modelling, a ‘model’ is the collection of likelihood
and prior specifications for all random variables in the model, along with any
functional relationships among them.
The denominator, p(D|I), is called the ‘evidence’, or marginal likelihood,
and it is a very important quantity: besides it being a normalizing constant,
it captures all the essential information of the probabilistic system7. This
can be better seen if we write it as
p(D|I) =
ˆ
dM p(M|I)p(D|M; I) . (2.15)
7The Bayesian evidence is the equivalent of the ‘partition function’, usually denoted byZ, in Statistical Mechanics. From it several other quantities, such as the expected energy,can be derived.
47
Besides its conceptual elegance, in practical terms Bayes’ theorem offers us a
framework for learning from data as inductive inference. Bayesian modelling
offers the following advantages:
• It forces us to make all our modelling decisions explicit.
• It is a consistent framework for learning under uncertainty.
• It offers modularity, since, due to its probabilistic foundation, models
can be combined or extended in a hierarchical fashion.
• Bayesian models can be extended to learn various hyperparameters,
controlling the distributions of other quantities. Indeed, one may start
from a simpler model (e.g. one that its properties is well-known) and
extend it afterwards.
The notion of sparsity also meshes well with the view of parsimony in the
Bayesian approach to data analysis, which automatically embodies the notion
of ‘Ockham’s razor’ (MacKay, [116]). This is analogous to the information
theoretic notion of ‘minimum description length’ (MDL) of Rissanen [146]
and offers a way of measuring and controlling complexity. This has effects
on modelling since, as pointed out by Neal [131], in the Bayesian framework
one should not artificially constrain analysis to simple models, as the use
of hierarchical modelling readily offers a mechanism for complexity control.
From a modeling point of view, these are also tools that can be used in order
to avoid ‘overfitting’ in the absence of detailed information (‘ignorance’) and
in the presence of noise.
Now, in Bayesian terms, inference is calculating the posterior probability
of the latent variables in a probabilistic model and learning is calculating the
posterior over the model parameters. Since in full Bayesian modelling both
latent variables and parameters get a prior distribution and are dealt in an
equal footing, these two computational processes are essentially the same.
48
Bayesian inference and learning, however, requires performing integra-
tions in high-dimensional spaces that, if tackled by sampling methods like
Markov chain Monte Carlo can be very time-consuming, or even practically
impossible, for large data sets. A mathematically elegant alternative, pro-
viding many of the benefits of sampling in terms of accuracy without the
computational overhead, is offered by the variational Bayesian method [7].
This is the inference method that we are going to use in this text.
This section provided a high-level overview of the philosophy and basic
principles of the Bayesian approach. More detailed explanations will be given
in the subsequent chapters and specific Bayesian methods will be introduced
on an as-needed basis.
49
Chapter 3
An Iterative Thresholding
Approach to Sparse Blind
Source Separation
3.1 Blind Source Separation as a Blind In-
verse Problem
Many problems in signal processing, neural computation, and machine learn-
ing can be stated as inverse problems, when the quantities of interest are not
directly observable but rather have to be inferred from the data [33]. Ex-
amples include signal deconvolution, density estimation, as well as many
problems in neuroimaging. Inversion procedures involve the estimation of
the unobserved generative variables (states, sources, or “causes”, in an ab-
stract sense) and/or parameters of a system from a set of measurements. If
we collectively denote the space of unknowns by X and the data space by
Y , we can symbolically write the process as shown in Fig. 3.1, where the
50
Forward problem
Inverse problem
Causes Effects
Figure 3.1: Conceptual scheme for forward/inverse problems.
rightwards arrow denotes the generative direction and the leftwards arrow
the direction of the inversion. — This is to be contrasted with computing
the data given their generators, which is called the ‘forward’ problem. — In
this Chapter we emphasize the fact that blind signal separation, the separa-
tion of mixtures into their constituent components, is inherently an inverse
problem, and that its solution can be derived from the general principles of
this class of problems. In particular, we will focus on the problem of blind
signal separation under sparsity constraints.
The most important characteristic of inverse problems is that they are
typically ill-posed (“naturally unstable”) [92]. A problem is said to be well
posed, in the Hadamard sense, if there is existence, uniqueness, and conti-
nuity (stability) of the solutions. If any of these conditions is not met, a
problem is called ill-posed. A critical consequence is that in inverse problems
even a minute amount of noise or error might get exceedingly amplified by
the inversion, if simple-minded, producing nonsensical results.
At first sight, solving an inverse problem might seem like a hopeless task
mathematically. However, a lot can be gained by appropriately constraining
it, as this forces the inversion algorithm to stay in meaningful regions of
the space of unknowns. What “meaningful” is depends, of cource, on the
particular problem domain. Knowledge of special properties of the problem,
captured in the form of prior constraints, can be extremely helpful in these
analyses, as they lead to regularization, making the problem well-posed [25],
51
[33].
In classical regularization theory (Tichonov and Arsenin, [165]), unique,
stable solutions are sought by requiring certain properties such as smooth-
ness and/or simplicity. The general idea there is to formulate optimization
functionals, also known as Lagrangians, of the form
I = ‖Ax− y‖2︸ ︷︷ ︸
ED
+λ ‖Bx‖2︸ ︷︷ ︸
EP
, (3.1)
where the penalty term, ‖Bx‖2, captures our prior knowledge or expectations
about the solution, x ∈ X , by penalizing unwanted properties of the solution,
controlled by the operator B, while the first term represents the data-fit.
The ‘forward’ linear operator, A, maps the unknowns to the observation
space. The λ factor is a Lagrange multiplier, balancing the effect of the two
terms on the solution1. This idea has found numerous applications in science
and engineering2. See also Chen and Haykin [33] for an excellent review of
regularization theory in the context of learning and neural networks.
1To see how regularization can stabilize the problem, rewrite the functional I ofEq. (3.1) as (see [25])
‖y −Ax‖2 + λ‖Bx‖2 =
∥∥∥∥
[y0
]
−[
A√λB
]
x
∥∥∥∥
2
.
The optimizer of the functional is then the (least-squares) solution of the linear system
[A√λB
]
x =
[y0
]
.
Under certain mild conditions on the matrix B (such as that the norm of ‖Bx‖ is bounded),the augmentation of the matrix A by
√λB considerably reduces the ill-posedness of the
problem.2In computer vision, for example, Poggio [142], Blake and Zisserman [18], and many
others, have addressed problems in early vision, such as shape from shading, surfacereconstruction, etc., [18], by formulating regularized functionals of the form of Eq. (3.1).Girosi et al. [59], pioneered the use of regularization theory in the machine learningcommunity.
52
Blind Source Separation
In the blind source separation (BSS) problem, a set of sensor signals re-
sults from mixing a set of unobserved ‘sources’ and is possibly contaminated
by noise. The goal is to recover/reconstruct the sources (‘unmix’ the ob-
servations) without any, or very little, knowledge of the sources themselves,
the mapping from sources to observations, or the noise process [98]. By its
definition, BSS is therefore inherently a multidimensional inverse problem.
The ‘cocktail party problem’, regarding the separation of mixed speech
signals from people talking simultaneously in a room, described in the Intro-
duction, is the prototypical BSS problem. Many other interesting problems in
biomedical signal processing, financial time-series analysis, and telecommu-
nications can also be cast as BSS problems [87]. In functional neuroimaging,
McKeown et al. [125] proposed the decomposition of spatio-temporal fMRI
data into independent spatial components and their associated time-courses
of activation using BSS techniques. Since then, numerous studies in fMRI
have been conducted under this decomposition framework, with considerable
success. BSS offers a powerful alterative to ‘model-based’ approaches, such
as the general linear model (GLM), for exploratory (data-driven) data anal-
yses of complex data sets for which precise information about the temporal
or spatial structure of the sought-for processes is either not available or is
itself under investigation.
Mathematical Model of Source Separation
In order to make the above abstract description of BSS mathematically and
computationally tractable, one typically needs to formulate a model of the
process —a model is a formal way of “thinking about data”— as well as an
algorithm for its inversion. A simple observation model, which nevertheless
works surprisingly well in a wide range of cases, assumes that the data is a
53
linear mixture of underlying sources with an additional ‘noise’ term: for the
dth sensor and the nth data point,
d : xd[n] = µd +L∑
l=1
Adlsl[n] + εxd[n], ∀n , (3.2)
wherexd
D
d=1are the D sensor signals,
sl
L
l=1the L unobserved sources,
µd is the mean of the dth observation, and εxdis the observation noise, or
error, term. That is, the models we consider here assume a noisy linear
transformation but with the crucial difference that the observation operator,
A =[Adl
], called ‘mixing matrix’ in BSS, is also unknown (hence the char-
acterization “blind” for this particular class of inverse problems). Indeed, we
can think of problems such as these as multiple regression with missing in-
puts (the sources). Collecting all signals, for all sensors and sources, together
the linear BSS model can be written in matrix form as
X = µ1T
N + AS + EX , (3.3)
with X ∈ RD×N , S ∈ RL×N , µ ∈ RD and A ∈ RD×L, where the indices d
and n now denote the rows and columns of the data matrix X, respectively.
We emphasize that we aim at estimating mixing matrices of arbitrary size
D × L, not just square ones, here. We will assume that the observations
have been de-meaned and fix µ at the mean of the observations, µ.= X,
(assuming zero-mean sources; see next)3 in this chapter. The image of S
under the linear operator A can be thought of as the “clean” data, X, which
is subsequently corrupted by EX. Source separation algorithms in general
3That is, X is replaced by X− µ1NT, and we shift the observation coordinate system
at the data centroid. The mean can always be added back if required. From now on,the symbol X will be taken to implicitly mean the de-meaned observations, unless statedotherwise.
54
may, or may not, include an explicit model for the noise.
The blind source separation problem is clearly ill-posed, since both fac-
tors, Adl and sl(t), in equation (3.2) are unknown, and its solution is non-
unique as it stands [79]. This is particularly true for the so-called ‘over-
complete’ case, where the number of sources is greater than the number of
observations. In this case, even if we knew the mixing matrix exactly, we
still wouldn’t be guaranteed a successful reconstruction, in purely algebraic
terms, since the system of equations to be solved is underdetermined. More-
over, noise could destabilize the inversion.
As in all inverse problems, the success of source separation hinges on
choosing an appropriate model for the unknown signals, the ‘source model’,
and on constructing an efficient inversion algorithm. It is well known (see e.g.
[148]) that classical, quadratic constraints cannot be used for blind source
separation. Quadratic constraints (or log–priors in a Bayesian formulation)
cannot lead to source separation but merely to decorrelation, since they only
exploit up-to second-order statistics in the data (see [16]); higher-order statis-
tics are needed for the BSS problem.
Optimizing for Independence vs Optimizing for Sparsity
Like many other learning problems, separating signals into their constituent
components can be formulated as an optimization problem, incorporating
a variety of appropriate penalties. Hochreiter and Schmidhuber [84] notice
that source separation can be obtained as a “by-product” of regularization
in a low-complexity autoencoder network, establishing a link between regu-
larization and unsupervised learning in the context of BSS. Indeed, most
approaches to blind source separation, such the independent component
analysis (ICA) family of models, also make analogous prior assumptions,
implicitly or explicitly. ICA forces independence among the components,
55
P (s) =∏L
l=1 Pl(sl), where P (s) is the assumed distribution of the sources.
This enables it to separate statistically independent sources (up to permu-
tation and scaling). Often the separation is performed under a maximum
likelihood framework [118], [169].
Crucially, however, the source separation algorithms in the ICA family
work under the assumption of mathematical independence among the com-
ponents. This is a very strong condition. Independent random variables are
uncorrelated; the converse is not true. In many physical and biological sys-
tems, such as brain fMRI, there is no physical reason for the components
to correspond to different activity patterns with independent distributions.
Although one can always map the data in a coordinate system such that
the representation is least-dependent, this does not necessarily lead to in-
terpretable components. In many cases, other conditions or priors can be
more physically plausible or efficient than the condition of mathematical in-
dependence. A substantial body of experimental and theoretical research4
(Field, [62]; Olshausen and Field, [137]; Donoho, [50]; Zibulevsky and Peal-
mutter, [176]; Li, Cichocki and Amari, [111]) suggests that sparsity with
respect to appropriate ‘dictionaries’ of functions/signals might be an under-
lying property of many information processing systems. Moreover, recent
work in neuroscience (Asari, [3]) and imaging neuroscience (Daubecies et al.,
[46]), shows how sparse respresentations directly facilitate computations of
interest in biological sensory processing and respectively reveals that activa-
tions inferred from functional MRI data have sparse structure [46], [30]. For
a sketch of the sought-for brain activations measured by fMRI see Fig. 1.1
and the description of Golden [77]. In all of the works referenced above the
signals have a natural sparse structure5.
4See section 2.4 for a more extensive literature review.5An important research question is whether sparsity represents something more fun-
damental, i.e. an organization principle, in physical and biological information processingsystems. This is an area of active current research, but it is beyond the scope of this
56
Remark 7 (The Sparseness Principle) Sparsity reflects the fact that the
basis functions used represent those natural signals most parsimoniously. In-
tuitively, the idea is that if the elements of the dictionary match the charac-
teristics or features of a signal, then we only need a few of them to describe
it. We can view these dictionary elements as “building blocks” that are used
to construct more complex signals.
Although in some cases the above separation problem can be solved by
independent component analysis, viewed as a technique for seeking sparse
components, assuming appropriate distributions for the sources, as shown by
Daubechies et al. in [46], in other cases this approach fails. We propose to
explicitly optimize for sparse (or minimum entropy, in the terminology of Bar-
low [12]) components in appropriate signal dictionaries, instead of optimizing
for independence. This is captured by formulating appropriate optimization
functionals, which are accordingly efficiently optimized by an algorithm spe-
cially constructed for this purpose. While classical regularization-theory con-
straints cannot themselves be used for BSS, for the reasons discussed above,
they provide a conceptual starting point in our search for regularizers, as
they are often related to the, more generally desirable, property of smooth-
ness. Smoothness is often a highly plausible constraint physically. Moreover,
inverse problems are usually ill-posed (in the sense of Hadamard), there-
fore sensitive to noise: a simple-minded inverse will amplify the noise and
swamp solutions. Consequently, some kind of smoothness constraint is com-
monly imposed. However, “plain” ℓ2 solutions, or solutions that are based
on smoothness only, such as Fourier bases or truncated SVD (see e.g. D.
O’Leary, [135]), may miss discontinuities and small features. A more general
definition of smoothness that allows for discontinuities and other singulari-
ties will be introduced next and a better basis, based on wavelets, will be
Thesis.
57
utilized in this text. As we shall see in subsection 3.2.2, one can go from that
requirement to a class of smoothness spaces, called Besov spaces, and finally
to wavelets, building blocks for signals offering sparse solutions.
3.2 Blind Source Separation by Sparsity Max-
imization
Before we elaborate on our approach for sparse BSS, we quickly give a very
concise summary of some key approaches to sparse signal reconstruction that
are highly relevant to our model next. The rationale of signal reconstruction
utilizing sparsity was introduced in Sect. 2.4.2. Here we make the connection
with some of the methods from the sparsity literature that can be regarded
as direct antecedents of some of the ideas in this Chapter. We then state the
goal of sparse component analysis (SCA).
3.2.1 Exploiting Sparsity for Signal Reconstruction
Unsupervised learning algorithms for sparse decompositions have received
much attention in recent years. In their seminal paper [137], Olshausen and
Field model natural images as a linear superposition of basis functions,φk
,
modulated (multiplied) by coefficients,ak
, termed ‘activities’ 6:
I(x, y) ≈∑
k
akφk(x, y) ,
where (x, y) is an image coordinate or picture element (pixel). The goal is to
6Note that the activities, a, here correspond to the sources in the BSS problem andthe bases, φ, to the columns of the mixing matrix.
58
0 0.2 0.4 0.6 0.8
1 1.2 1.4 1.6 1.8
-2 -1 0 1 2
S(a
)
a
S(a)
-1
-0.5
0
0.5
1
-2 -1 0 1 2
S(a
)
a
S’(a)
Figure 3.2: Sparsity function, S(·), of the model of Olshausen and Field [137]and its derivative, S ′(·). Note that the shape of the derivative is almost asigmoidal function. The latter encourages their learning algorithm to searchfor sparse solutions.
learn a sparse ‘code’, a =(a1, . . . , ak, . . . , a|Φ|
)(where |·| here denotes the size
of a collection of variables), for natural images as well as adaptively learn the
bases φk themselves from the data. The bases may be non-orthogonal. This is
achieved by constructing a non-quadratic energy cost function, parameterized
by Φ =[φk
],
E = E(a;Φ) =∥∥I−Φa
∥∥
2
2︸ ︷︷ ︸
information preservation
+ λ ·∑
k
S(ak)
︸ ︷︷ ︸
sparseness of ‘activities’
, (3.4)
where S : a 7→ S(a) is a sparsity function, penalizing non-sparsity of the
activities. The penalty term is then the sum of sparseness of the activities
vector, a, while the first term computes the reconstruction error, and en-
sures that information is maximally preserved. The activity vector records
the configuration of weights modulating the basis nodes,φk
, for a par-
ticular image instance. By putting a sparse prior on a we obtain a sparse
representation of I in Φ, as was shown in Fig. 2.3. Olshausen and Field
choose S(a) = log (1 + a2) as the sparsity function in their implementation
(see Fig. 3.2); however, there is a variety of other sparsity-inducing priors
that can be used (see section 3.2.2 later). A stricking result of their model
59
is its biological plausibility: the learned basis functions emerging from the
model resemble the spatially localized and oriented filters that are found in
the mammalian primary visual cortex, V1 [137].
While our approach is partially inspired by the work of Olshausen and
Field, our problem is different: our application is source separation, and we
aim at learning a sparse representation of whole images, rather that image
patches and filters. Moreover, in contrast to Olshausen and Field, we will
assume that they are the expansion coefficients of the sources with respect to
a signal dictionary, not (necessarily) the sources themselves, that are mostly
non-zero (inactive), for the reasons that will explained shortly. This will lead
to a different, hierarchical optimization functional7.
Donoho [50] reviews the paper of Olshausen and Field and, based on
their findings, proposes a research programme for sparse decompositions of
objects that belong to certain classes with well-describable geometrical char-
acteristics. This way, precise control over the decomposition can be achieved.
Although his class of objects is rather artificial, the general idea can have
a wider applicability. In this text, we propose an analogous approach, by
utilising “soft constraints” on the properties of the sought-for latent signals.
This will become more precise in Sect. 3.2.2.
3.2.2 Exploiting Sparsity for Blind Source Separation:
Sparse Component Analysis
Sparse BSS is based on the following key observation:
Remark 8 In the context of source separation, the key observation is that
7The model of Olshausen and Field can be considered the “baseline” case for sparsity,i.e. the case of a signal that is sparse in physical space (e.g. in time domain). Here thedictionary, Φ, is the ‘canonical’ basis, B = et, where et = (. . . , 0, 1, 0, . . .) with 1 in thetth position and 0 everywhere else. This is equivalent to saying that the sparseness prioris on the values of the signal itself, rather than on a representation in a dictionary.
60
the mixing process generates signals that are typically less sparse than the
original ones. This suggests that one could formulate a learning algorithm
for the inversion problem that seeks components that are maximally sparse,
in order to recover the sources.
We emphasize that sparseness is a relative property, with respect to a basis
set. In the general case, signals will not be sparse in the original observation-
domain but in a transformed domain, with respect to more complex build-
ing blocks. More general bases, or ‘dictionaries’, than the canonical basis
will therefore be needed to efficiently describe the signals [176], [111]. We
elaborate on this in subsection 3.2.2, ‘Sparse representation of signals using
wavelets’. The goal of Sparse Component Analysis with dictionary-based
priors can now be stated in the following definition (adapted from Zibulevsky
and Pearlmutter, [176]) as
Definition 1 (Dictionary-based Sparse Component Analysis) Given
an observed data set,xd
D
d=1, find the sources,
sl
L
l=1, and the mixing ma-
trix, A ∈ RD×L, such that the sources are sparsest, with respect to a given
signal dictionary, under the noisy linear model xd ≈∑D
d=1Adlsl.
We will develop algorithms that exploit the sparsity of signals8 to perform
blind source separation and signal reconstruction.
Sparse Representation of Signals using Wavelets
In this section we answer the question “what kind of dictionary?”. In par-
ticular, this section aims to provide some insight into how the combination
of certain concepts that will be introduced next allows us to exploit sparsity
for signal reconstruction. We only give the essential information here; more
theoretical background on wavelets and smoothness spaces can be found in
e.g. [44].
8By the term “signals” we mean the unknown signals (sources) here.
61
It is the interplay of three ingredients that allows us to exploit sparsity
for signal reconstruction, namely that:
• Many signal classes contain certain types of structure that can be cap-
tured by mathematical models. These signal classes can be quite gen-
eral, such as the class of finite-energy signals, for example. We will
discuss such a class here.
• Appropriate dictionary elements can be used as building blocks for sig-
nals, capturing the salient features of those signals efficiently.
• Certain regularizers/priors can implement and enforce the sparsity con-
straint on the expansion coefficients of the signals in those dictionaries
(see also Fig. 2.7).
We review some of the basic properties of wavelets that are of importance
to our framework next. Wavelets, and other wavelet-like function systems9,
are a natural way to model sparsity and smoothness for a very broad class of
signals [121]. They form ‘families’ of localized, oscillating, bandpass functions
that are dilated and shifted versions of a special function called ‘mother
wavelet’, ψ. As such, they inherently contain a notion of scale, the whole
family spanning several scales from coarse to fine. This property allows them
to form multi-resolution decompositions of any finite-energy signal, f .
In the classical setting, wavelets together with a carefully chosen, lowpass
function, the ‘scaling function’, φ, can form orthonormal bases for L2(R).
Let us denote such a basis with D = φk⋃(⋃
j
ψj,k
)
, where j denotes
the scale and k the shift. We can then perform wavelet expansions (inverse
9Such as wavelet packets, local cosine bases, curvelets, ridgelets, and a whole othervariety of bases.
62
wavelet transforms) of the form
c 7−→ f s.t. f =∑
k
c(φ)k φk +
∑
j
∑
k
c(ψ)j,k ψj,k, j, k ∈ Z, (3.5)
where10 ϕλ ∈
φk(t) ≡ φ (t− tk) , ψj,k(t) ≡ 2j/2ψ (2jt− tk)
j,kis an ele-
ment inD, and the scaling and wavelet coefficients, cλ ∈ C =
c(φ)k
⋃(⋃
j
c(ψ)j,k
)
,
known as the ‘smoothness’ (or ‘approximation’) and ‘detail’ coefficients, re-
spectively, are computed by taking inner products —measures of similarity—
of the corresponding bases11 with the signal:
λ : cλ ≡ fλ = 〈f ,ϕλ〉 . (3.6)
Wavelet bases in higher dimensions can be built by taking appropriate prod-
ucts of one-dimensional wavelets. An example of the wavelet decomposition
of an image is shown in Fig. 3.3. The above construction provides a multi-
scale representation of a signal in terms of a wavelet dictionary. There exist
computationally efficient, linear time complexity algorithms for performing
forward and inverse discrete wavelet transforms, utilizing pairs of perfect
reconstruction filters [120].
An intuitive way of defining the smoothness of a signal is to use some
kind of generalized derivative and quantify roughness by the “jumps” of
the values of the function along an interval. Holder regularity gives such
a measure: a signal, f , has Holder regularity with exponent α = n + r,
with n ∈ N and 0 ≤ r < 1, if there exists a constant, C, such that
10To unclutter notation, we will use a single generic index, λ, to index bases and theircorresponding coefficients and the symbol ϕλ for the bases, when appropriate, since theelements in D can be ordered as λ = λ(b, j, k) = 1, . . . , Λ= |D|. The symbol b indexes the‘subband’, i.e. the subset of wavelets corresponding to a particular direction, in higherdimensional wavelets; see Fig. 3.3, for example.
11More precisely, the dual (analysis) wavelet, ϕλ, which, in the case of an orthonormalbasis, is identical to ϕλ.
63
Figure 3.3: Wavelet coefficients of the ‘Cameraman’ image, cλ, arrangedin the Mallat ordering, λ ∈ MJ , for J = 4 levels of decomposition. (In theMallat ordering, the coefficients of the multiscale decomposition are placed ina rectangular arrangement such that the upper left corner is occupied by thesmoothness coefficients and the detail subbands are placed perimetricallyaround them, completing the square, starting from the largest scale andgoing outwards to the smallest scale. ) The physical-space image of Fig. 2.10is now represented/decomposed in a relatively small number of ‘primitives’(building blocks), the wavelet and scaling bases. Wavelets act as multiscaleedge detectors: the vertical, horizontal and diagonal elements of the imageare clearly visible. We see that the physical-domain image is transformedinto a space where much less bases are needed in order to describe the signal,as it is evidence from the ratio of zero (black) versus non-zero coefficients.
64
∣∣f (n)(x)− f (n)(y)
∣∣ ≤ C |x− y|r, x, y ∈ R. In words, the difference in the
values of the n–th derivative in an interval, x − y, is smaller than a cer-
tain multiple of the interval in the domain of the function, raised to the
rth power. Functions with a large Holder exponent will be smooth. Note
that the Holder exponent can be non-integer, offering fine-grained definition
of smoothness. The crucial link between Holder regularity and wavelets can
be expressed as follows (Daubechies, [44]): if a signal is smooth with local
Holder regularity with exponent α in a neighborhood (x0−ε, x0 + ε), around
x0 ∈ R, for some ε > 0, such that the neighborhood is covered by a set of
waveletsψj,k
with corresponding index-set I, (x0−ε, x0 +ε) ⊂ supp(ψj,k),
where ‘supp’ denotes the support, then the largest wavelet coefficient at each
scale j, will be smaller than a multiple of (2−j)α+ 1
2 , for all (j, k) ∈ I. That
is to say, smooth signals lead to small values for the wavelet coefficients of
the expansion. The above helps us transfer our attention from the concept
of smoothness expressed as Holder regularity in physical space to wavelets.
In terms of the above expansion, sparsity means that most coefficients
will be “small” and only a few of them12 will be significantly different than
zero13, resulting in very sparse representations of signals. In this sense, a
regularizer that enforces sparsity will therefore correspond to a ‘complexity’
prior 14 (Moulin and Liu, [129]) on the function f . Furthermore, in the stan-
12These large coefficients will be typically located around discontinuities and small fea-tures.
13This is in effect a definition of sparsity under a classical (‘frequentist’) interpretationof probability (relative frequency of occurrence over an ensemble of objects). Takingit literally, one could use an ℓ0, ‘counting’, norm, defined as the number of non-zerocoefficients in the sequence, ‖cλ‖0 = |cλ : cλ 6= 0|, where | · | here denotes cardinality,in order to measure the sparsity of signals. However, the ℓ0 norm is unfortunately verysensitive to noise and its computational complexity is quite high, making it a less attractiveoption for a sparsity measure in practice. Therefore, other norms, with p > 0, will be usedhere.
14The goal of simplicity leads to the strive for complexity control. That is, we invokethe principle of parsimony here: “simpler explanations are, other things being equal,generally better than more complex ones”. Furthermore, there is a natural connectionbetween complexity of representation and entropy. This leads to the goal of efficiency via
65
dard ‘signal plus noise’ model, decomposing data in an appropriate dictio-
nary will typically result in large coefficients modelling the signal and small
ones corresponding to noise. This is a reflection of the good approximation
properties of wavelets (Mallat, [121]).
Finally, there exists an important connection between wavelets and
smoothness characterization using Besov spaces. Besov spaces are smooth-
ness classes: a Besov space Bσp,q (R) is basically a space consisting of functions
with σ derivatives in Lp, with q providing some additional fine-tuning15 [45].
We say that an object belongs to a Besov space if it has a finite Besov norm.
Taken together,
Bσp,q =
f ∈ Lp(R) : ‖f‖Bσp,q<∞
,
where ‖·‖Bσp,q
is the Besov (semi-)norm16. The important fact for our purposes
is that wavelets form an unconditional basis for Besov spaces. This means
that the norm (“size”) of an object in Besov space can be equivalently com-
puted from a simple sequence norm (Chambolle et al., [31]), ‖f‖Bσp,p≍ ‖c‖s,p,
where
||c||s,p =
(∑
λ
2spj|cλ|p)1/p
,
where s = σ + 1/2− 1/p, j = j(λ) denotes the scale of the λ–th coefficient,
and ‘≍’ denotes norm equivalence. In other words, smoothness translates
to sparseness. This makes sense intuitively, because if there is a very high
degree of detail in a signal then we will need a large number of elements
from our dictionary of building blocks in order to accurately describe it. The
minimum entropy representations.15In particular, q < q′ ⇒ Bσ
p,q ⊂ Bσp,q′ . Note also that the Holder class is equivalent
Cσ = Bσ∞,∞ [54].
16A convenient choice in practice, which we will adopt in this text, is to set q = p anddrop this index from the equations.
66
above, functional-analytic framework, formalizes this concept.
The above norm is a special case of a weighted ℓp norm,
‖c‖ℓp,w =∑
λ
wλ|cλ|p,
used in Daubechies et al. (2005) [45], with weights given by wλ = 2spj.
ℓp norms, in particular, have been widely used in applications for modelling
wavelet priors. More generally, a variety of relevant priors can be used under
a Bayesian framework (Moulin and Liu, [129]).
Based on this background knowledge, the main point of this subsection
can be stated as [45]:
Remark 9 If the objects to be reconstructed are expected to be mostly smooth,
with localized lower dimensional singularities, we can expect that their expan-
sions into wavelets will be sparse.
An example of such an object/signal is given in Fig. 3.4. We shall exploit the
property of wavelets to parsimoniously describe the above class of objects in
order to construct algorithms for sparse BSS.
Formulating the Sparse Objective
As discussed in the Introduction, without appropriate constraints blind
source separation is a very ill-posed problem. The key observation we want
to exploit and formalize here is that: (a) a wide class of signals are sparse
with respect to (i.e. can be sparsely represented in) appropriate dictionaries
[176] and (b) that mixing generates denser signals. An intuitive way to for-
mulate the optimization functional for sparse problems would then be one
that captures this fact explicitly. In particular, we want to model the notion
that the activities of the sources with respect to those signal dictionaries are
“rarely active”, i.e. non-zero [137]. Following Olshausen and Field [137], we
67
Figure 3.4: A signal exhibiting smooth areas and localized lower dimensionalsingularities.
formulate the sparse problem in an energy minimization framework, instead
of working in a functional-analytic framework. Our model, however, assumes
sparse structure of the unknown sources in a dictionary, not (necessarily) of
the observations.
In order to formulate the sparse objective one may now consider di-
rectly using Besov priors in the wavelet domain (Leporini and Pesquet, [108];
Golden, [77]). The separation problem can then be formulated as one search-
ing for pairs of variables(sl, al)
L
l=1minimizing the functional
E(S,A
)=∥∥X− AS
︸︷︷︸
X
∥∥2
2+ λS
∑
l
‖sl‖Bσp,p
+ λA
∑
l
‖al‖22 , (3.7)
where 0 < λS, λA <∞ are Lagrange multipliers, at our disposal (see below).
For spatio-temporal data, such as fMRI, we seek spatio-temporal compo-
nents, i.e. pairs of time-courses and spatial maps.
The three terms appearing in Eq. (3.7) can be explained as follows. The
first term in the above objective forces the solution to be such that X, the
image of a value from the model, S, mapped to the observation space by
68
A, is as close as possible to the observations, X, in an ℓ2 sense (e.g. in the
matrix 2–norm or the Frobenius norm, in a matrix formulation). Therefore,
this is a ‘fidelity’ term, which minimizes the residuals by forcing the columns
of the mixing matrix to span the input space, ensuring that “information is
preserved” [137]. It is the equivalent of the χ2 for the observations empirical
statistic [116]
χ2 = (X−AS)T Σ−1 (X−AS) , (3.8)
where Σ is the data covariance.
The second and third terms are penalties on S and A, respectively. The
second term is a penalty on the “size” of the sources, sl, with respect to
the Besov norm Bσp,p. In doing this, we restrict the solution to belong to
certain classes of smoothness spaces, viz. Besov spaces. By writing the
wavelet expansion, s =∑
k∈Zc−1,kφ(t−k)+∑j≥0
∑
k∈Zcj,k2
j/2ψ(2jt−k), and
using the equivalence between the norms ‖s‖p + ‖s‖Bσp,q
and ‖c−1‖p + ‖c‖bsp,q
,
in Besov and wavelet-coefficient domain, respectively (see subsection 3.2.2),
this turns out to be a penalty on the coefficients of the decomposition of (the
source-part of) the solution with respect to a dictionary of functions17. That
is, sparsity is defined on an appropriate representation of the sources. The
second term is therefore a penalty that enforces the sparsity constraint.
To provide a little more flexibility than the above prior, we can use the
more general weighted ℓp norm proposed in Daubechies et al. [45],
‖cl‖p,w =∑
λ
wl,λ|cl,λ|p, wl,λ ∈ R, with 1 ≤ p < 2 . (3.9)
An important special case is gotten if we let the exponent, p, be equal to
the smallest value in its range, i.e. let p be equal to 1. This gives the ℓ1
17In a Bayesian formulation one can equivalently impose the prior
p(c|β) ∝ β(2J−1)/α exp(
−β‖c‖αbsp,q
)
, β > 0, α > 0, on the expansion coefficients, where J
is the number of decomposition levels [108].
69
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
l∞l4l2l1
l1/2l1/4
l0
Figure 3.5: A family of ℓp norms for various values of p. The unit circles(‘balls’) of each norm are drawn here. These are curves of the form |x/a|p +|y/b|p = 1, a, b > 0, defining ‘superellipses’ in the rectangle −a ≤ x ≤+a × −b ≤ y ≤ +b (Lame curves); here a = b = 1. Curves for ℓ∞, ℓ4, ℓ2, ℓ1,ℓ 1
2, ℓ 1
4, and ℓ0 are shown. The ℓ1 norm “strikes a middle ground”, by being
the sparsest one that is also convex.
prior, ‖cl‖1 = wl
∑
λ |cl,λ|, wl,λ = wl, which is related to the Besov norm
Bσ1,1. This choice has the advantage of being the sparsest prior in this family
that maintais the convexity property for the penalty: see Fig. 3.5. The ℓ1
prior has also proven to be robust to noise [111].
We enforce the weighted ℓp sparsity prior on cl in combination with the
structural constraint on the sources
sl =∑
λ
cl,λϕλ, l = 1, . . . , L , (3.10)
where cl,λ = 〈sl,ϕλ〉 is the inner product of the source sl with the dictionary
element ϕλ; these can be thought of as measures of “similarity” of these two
functions. In other words, we assume that the sources are synthesized as
linear combinations of our building blocks of choice, the wavelets ϕλ. We
essentially “project” the unknown signals in a Besov smothness space (Choi
and Baraniuk, [35]).
Finally, the third term is a “growth-limiting” penalty (Olshausen and
70
Field, [138]), which ensures that we avoid the unwanted effect of the mixing
vectors, al, growing without bound. Note that without this term we could
get a solution that maximizes the “likelihood” (fitting) of the observations
under the model18 without being a physically plausible one. Li et al. [111]
use an analogous constraint, namely that the vectors al are unit vectors,
by rescaling their 2–norms to be equal to 1, provided that we rescale the
sources accordingly. Olshausen and Field make a similar adjustment in [138]
by rescaling the ℓ2 norms so that the output variance of each coefficient is
held at a desired pre-fixed level.
Using the above, our optimization functional E can be finally written as:
EΦ(C,A
)=∥∥∥X−A
(ΦCT
)T∥∥∥
2
2+ λC
∑
l
∑
λwl,λ|cl,λ|p + λA
∑
l ‖al‖22 ,
S =(ΦCT
)T,
(3.11)
with 1 ≤ p < 2, where the linear operator Φ : cl 7→ sl represents the basis19
ϕλ. By putting the sparsity constraint on the activities of the sources, this
leads to a sparse representation of those sources in Φ, as will be demonstrated
later. To gain a little more intuition about the method, the optimization
functional can be viewed as a hierarchical “feedforward” network. This will
be shown in section 3.3.2, after the optimization algorithm is stated.
Remark 10 The energy function EΦ(C,A
)prefers bases that allow sparse
representations. To see this, consider the penalty term on the activities
cλ, with p < 2 as above. For a fixed level of activity ‘energy’ (variance),
‖c‖2ℓ2 =∑
λ cλ2 (see Fig.3.6), the value of the penalty is minimized if the
vector of coefficients has the fewest possible number of non-zero components.
18The precise definition of the term ‘likelihood’ in the probabilistic context will be givenin section 4.3.2.
19Note that this, multiplicative, formulation is given only for providing the formal spec-ification of the problem. In practice, the multiplication by Φ is never performed explicitly;wavelet analysis/synthesis is efficiently implemented by filter banks [121].
71
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-5 -4 -3 -2 -1 0 1 2 3 4 5
LaplaceGaussianUniform
Figure 3.6: Three distributions of wavelet coefficients with equal ‘variance’(spread) but different ‘kurtosis’ (peakedness). We can also think of thesecurves as graphs that represent weights that are put on each coefficient,according to its magnitude, or, conversely, as penalties. The highly-peakedone (solid curve) corresponds to a sparsity-inducing, ℓ1 prior, as it assigns amuch higher weight to small coefficients.
This can be also seen from Fig. 3.7, where the ℓ1 and ℓ2 ‘balls’ are plotted.
The amount of penalization is determined by the Lagrange multipliers λC
and λA, balancing these three different “forces” (the three terms of Eq. 3.11).
If these coefficients are large, then the last two terms will be small at the
minimizer, resulting in variables S and A that are smoother/smaller. On
the contrary, if these are small the algorithm will try to fit the data more
closely, at the expense of more “noise”. Thus the parameters λC and λA offer
a wide flexibility, but at the same time pose an issue on how to best set or
estimate them. In this text, a method that empirically performs an estimate
of quantities related to λC and λA that are important for the function of the
method is proposed. This will be made more clear in the sequel.
More generally, if SM : S → R is an appropriate sparsity function defined
on the space of solutions, S, penalizing non-sparse activities with respect to a
signal-generating “machine”,M, capturing our prior expectations about the
generative mechanism of the sought-for signals, we can give a more general
definition for SCA. Formally, and equivalently with definition 1 (given in
72
c1
c2
cs = (0, cs2)
cd = (cd1, cd2) y = Ac
ℓ1 ball
ℓ2 ball
Figure 3.7: This figure shows how the ℓ1 penalty leads to sparse representa-tions, compared to an ℓ2 penalty, for a two-parameter problem. In particular,the problem is to find the optimal vector of coefficients in the (c1, c2)–spaceunder the minimum ℓ2 and ℓ1–norms, represented by their correspondingballs, and such that the linear constraint y = Ac is satisfied. The optimalsolution under the ℓ1–norm, cs, within the affine space of solutions of thelinear system y = Ac, leads to a sparse solution, since it zeroes one of thetwo components of the c–vector. See also Fig. 2.7 and its correspondingexplanation in the text.
subsection 3.2.2), we can now state the goal of SCA as:
Definition 2 (Generalized Dictionary-based Sparse Component Analysis)
Find the expansion coefficients, C, of the sources with respect to a machine,
M, and the mixing matrix, A, as the minimizers of the regularized energy
(objective) functional E ,
(C⋆,A⋆) = argmin E with E =∥∥X−AS
∥∥
2+ λS SM(S) + λA ‖A‖2 ,
(3.12)
such as C is as sparse as possible under the given sparsity function SM(·).
For the model discussed in this chapter, we just define SM(S) to be the
weighted ℓp–norm on the wavelet coefficients of the sources, S ∈ S, as be-
73
fore20.
Optimization
The SCA functional, E , must be optimized with respect to (C,A) in order to
obtain an estimate of these values. In principle, one can now perform “direct”
optimization of this functional. Unfortunately, while the above formulation
provides a straightforward definition of the problem, as shown by Daubechies
et al. in [45], the variational equations derived directly from this functional,
expressed in the ϕλ–basis,
l : ∀λ ∈ IΦ,(ATAsl −ATx
)
λ+
1
2λCwλp |cl,λ|p−1 sgn (cl,λ) = 0 , (3.13)
where (·)λ denotes the operation of picking the λ–th element of the wavelet
transform of the argument in the parentheses, cl,λ is the inner product 〈sl,ϕλ〉,as before, and IΦ denotes the index-set of Φ, result in a coupled system of
nonlinear equations which is non-trivial to solve analytically21. Zibulevsky
and Pearlmutter [176] use a conjugate-gradient based approach to perform
optimization of Eq. (3.11) — conjugate gradients is a classical “workhorse”
method for inverse problems with large number of unknowns. However, as
discussed in their paper, gradient-based approaches result in slow learning22.
To circumvent these problems, we adapt a variational methodolody for in-
verse problems under sparsity constraints recently developed by Daubechies,
Defrise and De Mol [45] to blind inverse problems, i.e. to problems where
the observation operator is unknown. The method, utilizing ‘surrogate’ func-
20In Chapter 4 we show how the regularized energy functional E can be equivalentlyinterpreted in the probabilistic framework, allowing various extensions.
21By this we mean even the original (non-blind) regularization problem with ℓ1, or moregenerally ℓp, 1 ≤ p < 2, priors. The expression ATAsl couples all the equations together,and, since the equations are nonlinear, this makes the system difficult to work with.
22Zibulevsky et al. propose alternative algorithms as well [177], one of which, clusteringfollowed by separation, will be discussed in the applications.
74
tionals and leading to an ‘iterative thresholding’ algorithm, will be described
in more detail in section 3.3.1.
Remark 11 Iterative thresholding is essentially an algorithmic way to im-
pose and compute the sparseness constraint on the solution of a linear inverse
problem, which would otherwise be computationally much more difficult.
Remark 12 We note that blind inverse problems, such as blind separation
and sparse representation with a learned basis, lead to bilinear problems in
(S,A), and therefore the optimization functional E is not convex in general.
Nevertheless, the regularization idea can still be used.
Our solution will lead to an iterative algorithm that alternates between esti-
mating the sources and learning the mixing matrix:
• Sources, S: These are inferred using the Iterative Thresholding algo-
rithm, presented next, enforcing the sparse prior.
• Mixing matrix, A: These are learned using the corresponding estima-
tion equations obtained by solving ∂∂AE(S,A) = 0.
Alternatively, one can use a two-stage, clustering-separation, approach,
as in Zibulevsky et al. [177] and Li et al. [111]. We have experimented
with a simple spherical k-means algorithm [119] with encouraging re-
sults23. These will presented in section 4.6.
The next section will elaborate on the IT approach.
23Zibulevsky et al. use the fuzzy C-means algorithm.
75
3.3 Sparse Component Analysis by Iterative
Thresholding
3.3.1 Iterative Thresholding
Recently, Daubechies, Defrise, and De Mol [45], ‘DDD’ hereafter, introduced
an algorithm for linear inversion problems with sparsity constraints, the iter-
ative thresholding (IT) algorithm. The IT algorithm of DDD finds a unique
minimizer for functionals of the form
I(s) = ‖As− x‖2 +∑
λ∈Ic
wλ|cλ|p , where cλ = 〈s,ϕλ〉 , (3.14)
when the linear operator from the signals of interest to the data (i.e. the
observation operator), A, is known and the value of the exponent, p, of the
ℓp penalty is restricted to be greater than or equal to 1 (and less than 2).
We denote the index-set of the coefficients cλ by ICdef= λ : cλ ∈ C. This
notation allows for compact expression of structure in the set cλ by defining
appropriate predicates for the condition defining the set C.Before we describe the final sparse component analysis algorithm we re-
view iterative thresholding first. Our emphasis will be on a pictorial view
of the structure of the algorithm, rather than a rigorous functional-analytic
derivation, which can be found in [45].
To solve the optimization problem of Eq. (3.14), Daubechies et al. de-
veloped a variational methodolody in the deterministic framework. Their
approach to bypassing the coupling observed in the variational equations24
is to use surrogate functionals. In particular, using a technique from ‘opti-
mization transfer’ (see e.g. [85]), the optimization functional I is replaced
24The variational equations are coupled because the observation operator is not diago-nalizable in the ϕλ–basis in general.
76
by a sequence of simpler surrogate functionals,Ii
. The algorithm then
makes use of three signal spaces (Fig. 3.9): the observation (data) space,
X , the solution space, S, and the wavelet space25,ϕλ
, represented by the
corresponding wavelet expansion coefficients, C =cλ.
Remark 13 By making special assumptions/choices about the structure of
the solution space, S, such as it being a Besov smoothness space, one can
obtain/enforce the required properties for the solution, s, such as smoothness,
localization, and sparsity with respect to the dictionary, Φ.
The original functional is replaced by surrogate functionals that are con-
structed by adding a special term, I(s, z), to it such that certain terms
cancel out26:
I(s) ← I(s, z) = I(s) +
‖s− z‖2 − ‖As−Az‖2
︸ ︷︷ ︸
I(s,z)
, (3.15)
where z ∈ S is a “guess” value for the solution27. Provided that the operator
A is bounded, one can find a constant, C, such that∥∥ATA
∥∥ < C so that
CIL − ATA is strictly positive28 and I(s, z) is strictly convex in s for any
z. Then29 I(s, z) is also strictly convex in s and has a unique minimizer for
any z. See Fig. 2.7 After some manipulation, the surrogate functionals can
25In fact, any suitable orthonormal basis can be used.26In particular the terms in ATAs.27In theory, the IT algorithm works with any value for z. However, for faster conver-
gence, it is better to use a “guess” value that is “close” to the solution, s⋆. In the absenceof any meaningful (informed) first estimate, one can use, e.g., z = 0. For the subsequentiterations, z is set to the optimizer of the surrogate functional at the previous iteration.
28If ‖A‖ < 1, we can set C = 1, without loss of generality.29In particular the following theorem holds [174]:
Theorem 2 Let E(x) be an energy function with bounded Hessian ∂2E(x)/∂x∂x. Thenwe can always decompose it into the sum of a convex function and a concave function.
77
be written as
I(s, z) =∑
λ
s2λ − 2sλ
(z + AT(x−Az)
)
λ+ wλ|sλ|p
+ const. , (3.16)
where (·)λ and sλ are defined as in subsection 3.2.2.
Functional minimization then leads to an iterative thresholding algorithm
via a nonlinear Landweber-type iteration [101]. The nonlinearity is obtained
from the sparsity-promoting, non-quadratic penalty and results in a thresh-
olding operator on the values of the (wavelet) expansion coefficients of the
estimated signals in the orthonormal dictionary of our choice. It can be shown
(see30 [45]) that the minimizer of the original functional, I(s), is found by
the iteration s(i) 7→ s(i+1) given by
s(i+1) =∑
λ
Sτ
((IL −ATA
)s(i) + ATx
)
λ︸ ︷︷ ︸
cλ
︸ ︷︷ ︸
cλ
ϕλ, (3.17)
where Sτ : cλ 7→ cλ is a thresholding operator with threshold τ . This expres-
sion is obtained by functional minimization of the surrogate functionals. For
the case of an ℓ1 prior, Sτ becomes the soft-thresholding function (Fig. 3.8),
Sτ (c) =
c+ τ, if c ≤ −τ
0, if |c| < τ
c− τ, if c ≥ τ
. (3.18)
Remark 14 Note how the replacement of the functional I(s) with the func-
30Note that we use a different notation from that of Daubechies et al. (2004) [45] here,in order to conform with the BSS nomenclature. The correspondence is: s ↔ f , x ↔ g,A↔ K.
78
c = cc c = ( )τ cS
> τc−τ<c
c 0λ
c0 τ−τ
Figure 3.8: Soft thresholding function, Sτ : c 7→ c, with threshold τ , cor-responding to ℓ1 penalization. It can be seen that Sτ has a simple geo-metric interpretation: “small” input coefficients, c0λ, in the interval [−τ, τ ],are set to zero while all others are shifted by τ according to the equationc = sgn(c)(|c| − τ) I|c|>τ, where I|c|>τ is the indicator function on the in-terval |c| > τ. Since for a wide variety of natural signals analysis inappropriate wavelet-type dictionaries results in a large proportion of smallexpansion coefficients, this process results in sparsifying their representationin wavelets even more.
tionals I(s, z) causes the estimating equation (3.17) to decouple as a sum
over individual λ. Also note that, using the iterative scheme, the IT algo-
rithm of DDD avoids inversion of the operator A. This is important for both
computational and numerical reasons.
The iterative thresholding algorithm of DDD can be broken down into
the algorithmic steps shown in Alg. 1. Their schematic representation can
be seen in Fig. 3.9.
The steps (2)–(6) of the above procedure can be seen as a single ‘thresh-
olded Landweber’ step that maps s(i) 7→ s(i+1) via the compound operator
FIT =(Φ Sτ Φ−1
) L (3.19)
(where ‘’ denotes operator composition). Note that this operation is non-
linear, due to the nonlinearity of the thresholding operation [45], [31]. Equa-
79
Algorithm 1 Iterative Thresholding for Linear Inverse Problems(Daubechies, Defrise, De Mol)
0. Let the observed data be denoted by x ∈ X . Let i = 0 and pick aninitial value, s(0) ∈ S, for the unknown signal s, where the upper indexin parentheses denotes the iteration step. Then, iterate the followingsteps:
1. The current estimate of the signal, s(i), is mapped in to the observationspace via the ‘forward’ linear operator, A, as x(i) = As(i).
2. This leads to a residual ∆x(i) = x − x(i), which will be used as a“correction signal”to be fed back into the algorithm in order to adjustthe current estimate of the sources.
3. Compute an intermediate value, s(i), via a Landweber-type operation
L : s(i) 7→ s(i) = s(i) + ∆s(i),
where∆s(i) = AT∆x(i) = AT
(x−As(i)
).
That is, the residual information is mapped back to the latent space.
4. This intermediate value is then decomposed in a basis ϕλ (e.g. awavelet basis) via the corresponding wavelet analysis operator31, Φ−1,
as
c(i)λ
λ∈Ic
∈ C, where it gets penalized for non-sparsity (because of
the second term of the functional I) and soft-thresholded32 by theoperator Sτ (·), applied component-wise (see Fig. 3.8). This results in
a new wavelet sequence,
c(i)λ
λ∈Ic
:
λ : c(i)λ = Sτ
(
c(i)λ
)
, where c(i)λ =
(Φ−1s(i)
)
λ,
for all λ = 1, . . . ,Λ = |Ic|.
5. Finally, c(i) is mapped back to the unknown-signal space via the waveletsynthesis operator, Φ, giving
s(i+1) = Φc(i) .
This is our new (current) estimate of the unknown signal.
6. The algorithm is then repeated from step (2) until convergence, or untila pre-defined maximum number of iterations is reached.
80
tion (3.17) then implements the operator FIT.
Daubechies, Defrise and De Mol prove that if A is non-expansive, which
we ensure here by restricting its norm in an appropriate way (shown next),
the iterative thresholding algorithm converges to the minimizer, s⋆, of the
original functional, I.
Interpretation as a Majorize-Minimize method. The step-by-step
analysis above was given for the better understanding of the structure of the
algorithm; the actual implementation may differ. It highlights, however, the
connection with the generative model that will be proposed in the next Chap-
ter. Indeed, the IT algorithm can be interpreted as a ‘majorize-minimize’
(MM) type algorithm (Hunter and Lange, [85]), an algorithm closely re-
lated to the expectation-maximization (EM) algorithm. Each step involves
the construction of a surrogate functional, I(z(i)
), using principles such as
convexity and inequalities, that ‘majorizes’ an often complicated objective
function. This surrogate is then minimized in order to find a new point in
the domain of the objective, which will lead in the construction of a new
surrogate, etc. Our method of Ch. 4 derives a “soft” version of the iterative
thresholding algorithm of DDD from probabilistic principles by deriving a
variational EM–type algorithm. This enables its generalization such that crit-
ical parameters, such as the threshold value (see next), are learned/estimated
from the data.
3.3.2 The SCA-IT Algorithm
The Sparse Component Analysis by Iterative Thresholding algorithm alter-
nates between reconstructing the unknown signals under the sparsity con-
straint, imposed iteratively by the IT algorithm, and learning the observation
(mixing) operator by solving the equation ∂∂AE(S,A) = 0. The algorithm
81
1
2 34
5
6X S C
x
x(i)
∆x(i)
A
AT
s(i)∆s(i)
s(i)
s(i+1)
Φ
Φ−1
c(i)
c(i)
SτL
Observation space Solution space Wavelet space
Figure 3.9: An illustration of the Iterative Thresholding (IT) algorithm asa double mapping operation among signal spaces. The observed data isdenoted by x ∈ X and the sought-for signal by s ∈ S. The numberedsquares in the figure correspond to the algorithmic steps of Alg. 1. Startingfrom an initial value s(0), the IT algorithm performs a sequence of iterationss(i) 7→ s(i+1). The ‘forward’ mapping from the solution (unknown-signal)space, S, to the observation space, X , is represented by the linear operatorA, and that from S to the wavelet (coefficient) space, C, i.e. the waveletanalysis, by Φ−1. The iterative algorithm works by employing a Landweber-type operator, L : s 7→ s (L is the composition of operations (1), (2), and(3)), acting on the current signal estimate, s(i) ∈ S, and adjusting for datamisfit, in combination with a thresholding (shrinkage) operator, Sτ : c 7→ c,acting on the coefficients of the wavelet expansion of the “intermediate”
estimate s(i), denoted by c(i) =(
c(i)λ
)
. This latter operation embodies the
sparsity prior. The optimization algorithm tries to minimize the residual,∥∥∆x(i)
∥∥ =
∥∥x− x(i)
∥∥, where x is a value generated from the model, while at
the same time trying to find the sparsest signal s with respect to the basisD = ϕλ under an ℓp norm.
82
can be summarized as shown in Alg. 2. A crucial problem is the exact set-
Algorithm 2 Iterative Thresholding for Sparse Component Analysis (SCA-IT)
Let i← 0Initial estimates of S, A: let S← 0, A← Irepeat
Re-estimate the sources sl using the Iterative Thresholding algorithm(Alg. 1).Re-estimate the mixing matrix by A =
(XST
)· (λAIL + SST)−1.
Rescale A s.t. ‖al‖2 = 1, ∀l.if (∆E < ∆Etol or ∆A < ∆Atol) then
return (S,A)end ifLet i← i+ 1
until i > maxniters
ting of the threshold, τ . An approach in solving this problem is presented in
Sect. 3.5.
Hierarchical network representation
The model for SCA presented in subsection 3.2.2 can be also viewed as a
three-layer hierarchical “feedforward” network, according to a ‘signal flow’
representation. This is shown in Fig. 3.10. The three layers of the model
are the wavelet, source, and observation layers, respectively. The middle
and bottom layers taken as a pair, (sl, xd), interact via the connectionsAdl
, with the penalty term on the mixing matrix, A, modulating the
value of those connections in a ‘weight-decay’ fashion, with total connection
strength λA
∑
l
∑
dAdl2 (cf. the penalty term on A of Eq. (3.11)). The pair
of top and middle layers, (ϕλ, sl), interact via the connectionscl,λ,
with the penalty of the second term of Eq. (3.11), λC
∑
l
∑
λwl,λ|cl,λ|p, pe-
nalizing for non-sparsity, thus driving their values towards zero. All of the
above quantities also interact via the first, data-dependent term of Eq. (3.11),
83
cl,λ
A dl
...
...
...1 Λ
s s1 L
x1 xD
τ−τ
φ φ
Σ
EX
Figure 3.10: Sparse Component Analysis model as a layered “feedforward”network. This graph emphasizes the generative relationships among the dif-ferent signals in the model. In this representation, nodes represent signals(including noise “signals”) and links represent functional dependence rela-tions between signals. This network corresponds to the set of Eqns (3.8)–(3.11), along with the full model specification of that section. In particular,the nodes ϕ1, . . . ,ϕΛ are dictionary elements, s1, . . . , sL are the unobservedsources, and x1, . . . ,xD are the observations. Each set of connections in thenetwork is annotated with its respective set of parameters, as shown. Obser-vation noise EX with covariance Σ is assumed to be added to the sensors.The embedded figure shows the thresholding operation resulting from theℓ1 prior on the wavelet coefficients of the sources, cl,λ. The network andthe corresponding learning algorithm are designed such that the “activities”cl,λ are minimized under a sparse prior, leading to a sparse representationof the sources, slLl=1, in the dictionary ϕλΛλ=1.
∑
d
∑
n [xdn −∑
l Adl
∑
λ clλϕλn]2, which tries to fit the data as close as pos-
sible, given the constraints of the other two terms.
Effect of the sparse prior on the network. While the links between
the Φ and S layers in the network of Fig. 3.10 appear to suggest that the con-
nectivity pattern is full33, in reality, due to the sparse source model and the
33The connectivity pattern between the Φ and S layers of the SCA network is captured
by the matrix MΦ,S =[
MΦ,Sl,λ
]
, where MΦ,Sl,λ ∈ 0, 1, indexed by the index-set of the
wavelet coefficient matrix C, IC = 1, . . . , L × 1, . . . , Λ. A full connectivity pattern
means MΦ,Sl,λ = 1, ∀(l, λ). This network representation is especially intuitive in appli-
cations such as sparse coding. Note that in this case the ‘codes’ are the al and the‘responses’ are the sl,n.
84
resulting thresholding function, most bases will be effectively “switched off”
after learning, leading to very sparse connectivities. This can be seen by
observing the geometry of the soft-thresholding function of Fig. 3.8 in com-
bination with a typical histogram of cλ values, such as the one shown in
Fig. 4.5: a threshold τ will set to zero all values −τ < cλ < τ . The specifi-
cation of the model is such that the ‘activities’, cl,λ, are minimized, leading
to a sparse representation of the sources in D. This can be seen as an
application of the principle of parsimony.
We also note that, in common with other sparse learning models (see e.g.
Olshausen and Field, [138], p. 3323), although the network generative equa-
tions, Eq. (3.11), are linear, the learning algorithm, which results from the
sparsity constraint, is nonlinear, leading to non-trivial interractions among
the bases. This is especially true for overcomplete (non-critically sampled)
bases34.
3.4 Experimental Results
The experimental results presented in this section are divided into three
classes. The first concerns the separation of natural images, for both arti-
ficial and natural mixing conditions. The second class considers the blind
separation of more sources than mixtures. Finally, the third deals with the
extraction of weak biosignals in noise. In particular, the decomposition of
spatio-temporal fMRI datasets is studied in these experiments. In the first
part, simulated activations drawn from two different source distributions and
two overlapping conditions are separated. In the second part, experiments
on real fMRI data are used to compare SCA-IT with ICA.
34While the wavelet transform is known to largely decorrelate signals, some higher-orderinteractions remain.
85
3.4.1 Natural Image Separation
We apply the SCA-IT algorithm on two natural image datasets: the first is a
mixture of known images, the ‘Cows and Butterfly’ dataset35, and the second
is the dataset used by Adelson and Farid in [61], assessing the ability of blind
source separation algorithms in separating reflections from images. In both
datasets, we have a set of L = 2 unknown images to be estimated from a
set of D = 2 observed mixtures of them. We note that the second dataset
has proven to be a rather difficult task for many ICA algorithms. The usual
assumption of decomposing along one-dimensional independent signal spaces
has proven to be of marginal value here. The results, along with explanations,
are shown in Figs 3.11 to 3.23. Some comments on running SCA/IT on the
datasets are given next.
• For the ‘Cows & Butterfly’ dataset, we use additive noise generated
as εxd= 0.1 std(xd) ε1, where ε1 ∼ N (0, 1), and for the d–th sensor
signal.
• In some cases where the operator A was known, a near-perfect recon-
struction was possible even in relatively high noise regimes. In this
case the task is to reconstruct the unknown signals and filter the noise.
Note that this is still an inverse problem, though not blind.
• The evolution of the solution was monitored by evaluating the objective
function, J , at each iteration step, i. We monitor the value of the
optimisation functional, J , and stop iterations when the change in Jis below a certain limit (0.001).
As shown next, good separation was obtained from both datasets.
35These images can be found at http://www.cis.hut.fi/projects/ica/data/images/.
86
Artificial Mixtures
This experiment is mainly used here mainly to demonstrate the robustness
of the SCA-IT algorithm to sources that are less sparse and its potential
applicability to natural images. This can be seen by observing e.g. the
histograms of Fig. 3.18, which are far from being ‘spicky’, and the shape of
the scatterplot of Fig. 3.17.
We compare the SCA-IT algorithm with the noiseless IFA algorithm with
MoG source priors [6] using M = 4 Gaussian components36. We note that the
IFA algorithm should be suitable for this kind of task, since the multimodal
distributions exhibited in natural images can be modelled particularly well
with MoGs, as was shown in Fig. 2.11. (This fact was also exploited in [38],
using a mean-field Bayesian version of IFA.) We also note that Li et al. [111]
also use a sparse decomposition algorithm for separating images; however,
they only operate on edge-detected images, which, by their nature, are sparse.
Therefore, that task is (at least in theory) easier for a sparse algorithm.
The SCA-IT algorithm is parameterized by the threshold, τ . For this
experiment it was changed from τmin to τmax and the optimal threshold, τ ⋆,
was obtained using a subjective, visual criterion, namely that of balancing
smoothness with crispness of the resulting images. It was found that a value
of τ = 0.05 gave the best results.
Figure 3.11 shows the (unseen) latent images. These images were mixed
using the mixing matrix M =
0.7 0.3
0.55 0.45
into two “observations”.
Figure 3.12 shows the IFA solution along with its scatterplot (Fig. 3.13)
and the estimated PDFs (Fig. 3.14). It can be seen that while the source
images are recovered, there is still some presence of noise.
36The noiseless version of IFA was selected as a competing algorithm because in thistask we want to evaluate the source modelling capability of SCA-IT. Noiseless IFA allowedus to effectively focus on that aspect only.
87
Figure 3.11: ‘Cows & Butterfly’ dataset: original sources, downscaled to64× 64 pixels.
Figures 3.15 to 3.18 show the solution from SCA-IT. In particular, Fig. 3.15
contains the estimated latent images after 750 iterations of the SCA/IT al-
gorithm. Fig. 3.16 shows their comparison with the known (in this case)
sources, in the form of scatterplots of the corresponding pixel intensities,
s1 vs. s1 and s2 vs. s2. In Fig. 3.17 we see the latent source space, i.e. the
scatterplot, s1 vs. s2. Finally, Fig. 3.18 shows the empirical histograms of
the SCA/IT solution. The correlation coefficients between the original and
reconstructed sources were r1 = 0.996 and r2 = 0.998, showing a near-perfect
reconstruction.
Separating Reflections in Images
A demonstration of the SCA-IT algorithm for the separation of reflections
in an image is shown next. This is the ‘Terrace’ dataset, originally used by
88
Figure 3.12: IFA results: Separated image of (a) ‘Cows’ and (b) ‘Butterfly’.While the source images are recovered, there is some presence of noise.
Figure 3.13: Scatterplot of the ‘Cows & Butterfly’ estimated images, afterhaving been unmixed by IFA.
Adelson & Farid in [61]. A pair of images was acquired through a linear
polarizer in front of Renoir’s ‘On the Terrace’ painting framed behind glass
with a reflection of a mannequin. Note that the mixing is not artificial in
this case. The objective is to remove the reflection, so that the painting is
shown intact.
The observations are shown in Fig. 3.19 and the reconstructions in Fig. 3.21.
The threshold, τ , in this case was empirically set such that the resulting
components balance image reconstruction accuracy and smoothness: the fi-
nal reconstruction is a trade-off between noiseless and over-blurred results.
89
Figure 3.14: Probability density functions of the sources as estimated fromthe IFA algorithm. M = 4 Gaussian components were used for each sourcemodel. The brown curve is the estimated mixture density.
Figure 3.15: SCA-IT results after convergence of the algorithm: separatedimages of ‘Cows’ and ‘Butterfly’.
90
0
1
2
3
4
5
6
0 50 100 150 200 250
0
0.5
1
1.5
2
2.5
3
3.5
4
0 50 100 150 200 250
0
0.5
1
1.5
2
2.5
3
3.5
4
0 50 100 150 200 250
0
1
2
3
4
5
6
0 50 100 150 200 250
Figure 3.16: Scatterplot of the original and SCA-IT unmixed images.
0
0.5
1
1.5
2
2.5
3
3.5
4
0 1 2 3 4 5 6
Figure 3.17: The ‘Cows & Butterfly’ estimated source space, (s1,n, s2,n)Nn=1,after convergence of the SCA-IT algorithm.
91
0
20
40
60
80
100
120
140
0 0.5 1 1.5 2 2.5 3 3.5 40
50
100
150
200
250
0 1 2 3 4 5 6
Figure 3.18: Empirical histograms of the ‘Cows & Butterfly’ sources afterconvergence of the SCA-IT algorithm.
Figure 3.19: ‘Terrace’ (reflections) dataset: observations.
In Fig. 3.20 and 3.22 we can see their respective scatterplots. Finally, the
empirical histograms of the estimated sources are shown in Fig. 3.23. No-
tice how the structure in the data, appearing in the two elongated clusters
along non-orthogonal directions in data space, emerges after the algorithm
has converged. In particular compare Fig. 3.20 (before) with Fig. 3.22 (af-
ter). We finally note that the algorithm of Farid and Adelson was specially
constructed for this separation task, while our algorihm is a general-purpose
one.
3.4.2 Blind Separation of More Sources than Mixtures
The goal of this experiment is to reconstruct (‘unmix’) a set of observed sen-
92
Figure 3.20: ‘Terrace’ dataset: Scatterplot of observations, (x1,n, x2,n).
Figure 3.21: ‘Terrace’ dataset: Estimated (unmixed) source images.
Figure 3.22: ‘Terrace’ dataset: Scatterplot of estimated sources after 100iterations of SCA/IT.
93
Figure 3.23: ‘Terrace’ dataset: Empirical histograms of estimated sourcesafter 100 iterations of SCA/IT.
sor signals, xd, d = 1, . . . , D, into their constituent generators, the ‘sources’,
sl, l = 1, . . . , L. This is an instance of the blind inverse problem known as ‘the
cocktail party problem’: we seek the “causes”,sl
L
l=1, that generated the
observations but we do not know the generative mechanism in advance (hence
‘blind’). That is, we also have a system identification problem. In the par-
ticular case we are discussing, the signals are timeseries, xd = (xd,1, . . . , xd,T )
and sl = (sl,1, . . . , sl,T ), where the time-index runs from t = 1 to t = T .
We can think of the observed signals as generated from two “microphones”
placed at different positions, d = 1, . . . , D. If the number of unknown source
signals, L, is larger than the number of observed ones, D, the problem is
called ‘overcomplete’. It is of course a much more difficult problem, since the
number of unknowns is larger than the number of observations. To make the
problem well-posed, constraints must be imposed onto the learning system,
usually encoded in the form of priors.
In order to solve the inverse problem an appropriate mathematical model
of the generative mechanism must be posed, along with a model of the un-
known source signals37. We assume a linear mixing (superposition) model
37In the probablistic framework the model is the assumed probability distribution of thesource amplitudes, p(slt).
94
for the observed signals. That is, we assume
t : xdt =L∑
l=1
Adlslt + εdt ,
at each time-point t. The term εdt is a noise process. The ‘mixing’ (or ‘sys-
tem’) matrix, A =[Adl
]is also unknown. It is formed by all combinations
of indices d and l, i.e. for each ‘path’ l → d, and captures some essential
aspect of the ‘propagation medium’. The mixing matrix is also going to be
estimated by the algorithm. Note that, in the timeseries case, additional
modelling assumptions about the sources can be made; however no special
structure in the d indices is assumed. We assume, however, that the time-
series is sparsely representable in a wavelet basis. It turns out that, despite
its simplicity, the above formulation works surprisingly well for a wide variety
of problems. A representative result is shown next.
Experimental Results. We experimented with a set of three sound sources
(from Te-Won Lee et al., [107], downloaded from his Web site), shown in
Fig. 3.24, that were mixed into two observations using the mixing matrix
A =
1 1/√
2 1/√
2
0 1/√
2 −1/√
2
,
as in [107], giving the mixed signals shown in Fig. 3.25. The scatterplot of
the two observations is shown in Fig. 3.26.
95
-1
-0.5
0
0.5
1
0 5000 10000 15000 20000 25000
True source s3
-1
-0.5
0
0.5
1
0 5000 10000 15000 20000 25000
True source s2
-1
-0.5
0
0.5
1
0 5000 10000 15000 20000 25000
True source s1
Figure 3.24: Overcomplete blind source separation: true sources,sl
l=1,...,3,
(unknown).
-1
-0.5
0
0.5
1
1.5
0 5000 10000 15000 20000 25000
Mixture X2
-1
-0.5
0
0.5
1
0 5000 10000 15000 20000 25000
Mixture X1
Figure 3.25: Overcomplete blind source separation: observed sensor signals,xd
d=1,...,2.
96
-0.4
-0.2
0
0.2
0.4
-0.4 -0.2 0 0.2 0.4
Scatterplot of the data, X
Figure 3.26: Overcomplete blind source separation: scatterplot of sensorsignals,
(x1,n, x2,n)
.
Following Zibulevsky and Pearlmutter [176], and Li, Cichocki and Amari
[111], a first estimate of the mixing matrix, A, was estimated by clustering
in the transform domain38 using a spherical k-means algorithm39 [49]. In
particular the spectrogram of the observations, Xl(ωk, tm), where ωk is the
frequency at position tm, was computed40 and it is shown in Fig. 3.27. The
scatterplot of the data points in spectrogram-space, Fig. 3.28, clearly reveals
much more structure than the corresponding one in the original observation
space. Note that the spectrogram was only used for the initialization of
the mixing matrix. The SCA-IT algorithm itself used the standard wavelet
transform as described in Alg. 2.
38Note that we have control over the data matrix, X, because it is observable. Thereforewe can perform various exploratory data analyses on X and select the one that works bestfor the particular situation at hand.
39In a Bayesian formulation one could use e.g. mixtures of Von Mises distributions [8],which is a good model for directional data.
40A Hamming window of length R = 256 was used for the short-time Fourier transformwith Nw = 512 points per slice and ‘hop’ distance Lhop = 128. The sampling rate of thesignals was fs = 8000 Hz.
97
Spectrogram of x2
0
500
1000
1500
2000
2500
3000
3500
0 0.5 1 1.5 2 2.5 3 3.5 4
Spectrogram of x1
0
500
1000
1500
2000
2500
3000
3500
0 0.5 1 1.5 2 2.5 3 3.5 4
Figure 3.27: Overcomplete blind source separation: spectrogram, Xl(ωk, tm),of the sensor signals.
-200
-150
-100
-50
0
50
100
150
200
-200 -150 -100 -50 0 50 100 150 200
Scatter plot in the transform domain, (X1(ω),X2(ω))
Figure 3.28: Overcomplete blind source separation: Scatterplot of sensorsignals in the transform domain,
(X1,n, X2,n)
, where n = n(ωk, tm) is the
single index that results from flattening each two-dimensional spectrogramimage into a long vector.
Clustering on the unit hypersphere SD−1 using the spherical k–means
98
algorithm results in the following direction angles for the mixing vectors:
θ1 = 34.9, θ2 = 89.9, θ3 = 144.1. These were used to compute the initial
mixing matrix, which was
Ainit =
cos(θl)
sin(θl)
=
0.82031 0.00083 −0.81029
0.57192 1.00000 0.58603
.
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 5000 10000 15000 20000
Estimated source s3
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 5000 10000 15000 20000
Estimated source s2
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
0 5000 10000 15000 20000
Estimated source s1
Figure 3.29: Overcomplete blind source separation: estimated sources,sl
l=1,...,3, by the SCA/IT algorithm.
We also let the threshold vary such that the (subjective) criterion of
clarity of the separated signals was maximized. The “optimal” threshold
was set to τ ⋆ = 10−3. The SCA-IT algorithm was then run for 1000+1000
iterations, first keeping the mixing matrix fixed to its initial value and then
unclamping it, giving the reconstructed sources are shown in Fig. 3.29. The
reconstruction accuracy as measured by the correlation between the true and
estimated sources was r11 = 0.860, r22 = 0.938, and r33 = 0.949, respectively.
99
For comparison, the corresponding results from Lee et al. [107] were r11 =
0.890, r22 = 0.927, r33 = 0.933. Girolami [76] report results using the same
dataset that are better than Lee et al.’s, however41.
Despite the rather “crude” value for the threshold, the fact that the tem-
poral structure of the signals was not taken into account, and the fact that we
“ignored” sensor noise for this experiment, the results are good. We observe,
however, that the sparser the sources are (in the wavelet dictionary used),
the better the results of the algorithm. This can be seen from the scatterplot
of the observations, Fig. 3.26, where the source at angle 45 is very diffuse.
This source is estimated less well by the algorithm, because it cannot find the
corresponding separating direction as well as that of the other sources (recall
that estimation of the mixing matrix is performed in the original observation
domain by SCA-IT). In fact using the wavelet transform is not the best basis
for audio source separation. Zibulevski et al. [177] used a short-time Fourier
transform as their basis, Φ, which according to their results proved to be the
most efficient representation for voice signals. Moreover, as noted by Li et
al. [111], if the signals overlap significantly (in the transform domain) perfect
reconstruction is not possible, even if we know the mixing operator exactly.
Experimenting with various bases is still an open research question.
3.4.3 Extracting Weak Biosignals in Noise
Many physical and biological phenomena can be better studied in a ‘hidden’
signal space that best reveals the internal structure of the data. In many cases
this space may be adaptively inferred from the data itself42 by searching for
a transformation from the observation space to that hidden space such that
the resulting (transformed) dataset has certain desired properties, captured
41Both Lee et al. [107] and Girolami [76] use a signal-to-noise ratio (SNR) measure toreport their results.
42As oposed to using fixed transformations, such as wavelet analysis, for example.
100
by appropriate prior constraints. Often the data analysis problem can be
formulated as a decomposition problem. In the functional neuroimaging field,
for example, researchers have proposed the decomposition of spatio-temporal
functional MRI data into “independent” spatial components [125]. An fMRI
dataset is obtained by measuring the ‘haemodynamic response’ (change in
blood flow) of tiny volume elements (‘voxels’) of the brain43, denoted by v
and serving as space indices, to an external stimulus at a particular time,
t [104], [133]. Measurements are performed in a brain scanner specifically
designed for this purpose. The primary form of fMRI uses the blood oxygen
level-dependent (BOLD) contrast44 [134]. The aggregate of all voxels in
space, Ω = v, gives us a three-dimensional picture of the brain. This is a
particular “snapshot” at time t; repeating this process for a period of time,
discretized at 1, . . . , T , allows us to record the spatially-distributed dynamics
in the brain. The haemodynamic response is related to the neuronal activity
in the brain, and thus, indirectly, neuroscientists may capture certain aspects
of relevant brain processes. (This is an oversimplified description of the
process, however it is sufficient for our purposes.)
Mathematical Model of fMRI Source Separation. Mathematically,
the observations, X =xt,v : t ∈ t1, . . . , tT, v ∈ Ω
, where Ω ⊂ E3 is a
volume in 3D space and t1, . . . , tT ⊂ R+ is a time interval sampled at ∆t
(called ‘repetition time’ and denoted by TR in functional neuroimaging), form
a three-dimensional scalar-valued45 time-varying field, (t, v) 7→ xt,v. It is of
interest to neuroscientists to decouple the time-evolution of voxels from the
43In a sense, this is equivalent to the Eulerian formulation in Physics, where a quantityis monitored inside a control volume fixed in space.
44Contrast agents are substances whose purpose is to alter the magnetic susceptibilityof tissue or blood, leading to changes in MR signal.
45More precisely, the data are measured in k–space (Fourier space), so the measurementsare actually complex; however these are usually transformed in real-space before furtherprocessing.
101
space-variation in the brain response. Therefore, the aim here is to factorize
this spatiotemporal field in terms of a factor capturing the spatial variation
times a factor capturing the dynamics. “Unfolding” the domain Ω into a
linear index set46 IX =1, . . . , N
, where N is the total number of voxels,
N = |Ω|, and putting the observations at time t, xt, into the t–th row of
a T × N data matrix, X, analysis of fMRI data into components can be
expressed as
XT×N≈ A
T×L· S
L×N=
(
ST
N×L·AT
L×T
)T
. (3.20)
A decomposition is sought such that the set of the column vectors of the “mix-
ing” matrix47 A, alLl=1, contains the prototypical ‘time-courses’,Al(t) :
t ∈ t1, . . . , tTL
l=1, and the set of the rows of S contains the corresponding
‘spatial maps’,Sl(v) : v ∈ Ω
L
l=1. The spatial maps can be thought of
as an image of the response of the brain to the stimulus, at each voxel, v.
The observed time-course of voxel v, according to this model, is obtained by
modulating those prototypical timecourses,al
L
l=1, by sl,v:
v : xv =
L∑
l=1
sl,val + εv , (3.21)
where the residuals, εv, can be thought of as “noise”. Under certain neuro-
physiological assumptions (see [71], [124], for example), a subset of those pro-
totypical timecourses should somehow “resemble” the stimulus timecourse.
46This unfolding operation is a bijection: we can always recover the 3–dimensionalvolume if we need to. In fact, after processing, the results are displayed in their originalspatial form.
47In using the notation shown in Eq. (3.20), we adopt the ‘spatial’ formulation of compo-nent analysis (sCA) for the problem of decomposing fMRI timeseries. Strictly speaking, intraditional BSS terminology the matrix ST should be called the mixing matrix, since it isthis matrix that contains the mixing coefficients, sv,lLl=1 for the “regressors” alLl=1, foreach voxel, v (see the equation for the observed time-evolution of the voxel xv, Eq. (3.21),below). In fact, in the so-called ‘temporal component analysis’ (tCA) model, the dimen-sionalities are exactly reversed. We will keep the terminology used in the fMRI communityhere and use the term ‘mixing matrix’ for A.
102
Component analysis studies exploit this fact in practice to assess the success
of the decomposition, by testing whether some al resulting from the analysis
correlate well with the stimulus48 (and the resulting spatial maps show acti-
vated regions that are neuroscientifically plausible—such that, for example,
if we have a visual stimulus, we expect (parts of) the visual cortex to acti-
vate; complex interactions with other brain areas may also be present). The
hidden data space mentioned in the introductory paragraph is exactly the
S–space. Our task is then to estimate bothsl,v
and
al
.
Extracting useful neuroscientific information from fMRI datasets is far
from trivial, however. fMRI images are very noisy and the signals of interest,
the so-called ‘activations’, have typically very small signal power. To make
things worse, the changes we are looking for are similarly small in scale, on
the order of 3% (Lazar, [106] p. 54). Finally, the measurements are in a very
high dimensional space, making data analysis even more challenging.
The above issues have led to a long line of research on relevant methods.
Component analysis, especially ICA, has been successfully applied to 4D
spatiotemporal fMRI data in two different “flavors”: temporal ICA (tICA),
where each source is a timeseries, and spatial ICA (sICA), where each source
is an image [161]; see [23] for a nice overview. McKeown et al. [125] is
probably the first research paper proposing the decomposition of fMRI data
in independent spatial components. Since then, numerous studies have been
contacted under this decomposition framework in fMRI, with considerable
success. To address certain deficiencies of “classical” ICA, such as square-
only mixing and lack of dimensionality and noise estimation, Beckmann and
Smith [15] propose a probabilistic version of ICA (PICA). PICA extends
spatial ICA using a noise model in the form of a signal subspace identifica-
48More precisely, they are tested for correlation with the expected time-course, which isgotten by convolving the experimental paradigm with the haemodynamic response func-tion.
103
tion method, utilizing singular value decomposition/probabilistic PCA; this
allows model order selection. The separation is obtained via the FastICA
algorithm. In addition, inference of the activated voxels is obtained, after
separation, using an (external) mixture of Gaussians model such that the
bulk of the voxels, which correspond to non-activation, is modelled by a sin-
gle ‘fat’ Gaussian, while the activated ones are modeled using the rest of the
Gaussian components. Calhoun et al. [24] studied the spatial and temporal
ICA decomposition of fMRI data-sets containing pairs of task-related wave-
forms. Four novel visual activation paradigms were designed for this study,
each consisting of two spatiotemporal components that were either spatially
dependent, temporally dependent, both spatially and temporally dependent,
or spatially and temporally uncorrelated, respectively. Spatial and tempo-
ral ICA was then used on these datasets. The general result was that each
ICA mode “failed” to separate the components that were not independent
according to its assumptions, i.e. spatially independent for sICA and tem-
porally independent for tICA, and “worked” otherwise. Inspired from this
work, Benharrosh et al. [17], reported results of applying ICA algorithms49
to fMRI from the point of view of relating the statistical and geometrical
characteristics of the data to the properties of the algorithm. In particu-
lar issues such as separatedness (overlapping sources), uncorrelatedness and
independence were discussed. Continuing on this theme, Daubechies et al.
[46] conducted a large-scale experimental and theoretical investigation of the
underlying principles of decomposing brain fMRI data using ICA. A new
visual-task block-design fMRI experiment, inspired by [24], using a super-
position of two spatiotemporal patterns was used. There were 17 runs for
each subject, each run resulting in 238 full brain volumes of 15 slices each.
This research lead to the rather surprising result that while ICA algorithms
are in theory built such that they seek independent components their appar-
49Using InfoMax as implemented in NISica and FastICA as implemented in Melodic.
104
ent success is due to the sparsity of the sought-for sources, at least in the
fMRI decompositions. There is no biological reason one should strive for
independence in brain processes. Indeed, sparsity was identified as the main
contributing factor in the success of the method for fMRI. So the paper con-
cluded with the proposal of building models and algorithms that explicitly
optimize for sparsity.
While this present text is more ‘methods’-oriented, taking a machine
learning perspective to data analysis/BSS, this work was mainly inspired by
experiments of the neuroscience community in functional MRI [17], and it is
a continuation of the ideas suggested in Daubechies et al. [46].
Simulated Data
In this subsection we repeat the simulations of Daubechies et al. [46] us-
ing the SCA-IT model. This experiment is designed to assess the quality of
separation of various BSS algorithms in well-controlled situations. In par-
ticular, it is designed to help identify the regimes under which the original
BSS algorithm used in [46], that is ICA, may succeed or fail. As was shown
in [77] and [46], ICA surprisingly sometimes fails in situations that should,
by design, succeed, i.e. in some mixing situations where the components are
independent, or nearly independent, and, even more surprisingly perhaps,
manages to successfully separate mixtures which are far from independent.
The data were generated by simulating two spatial components and mix-
ing them into two “observations”, at times t1 and t2. Each component com-
prised of an “activated” region (a rectangle) embedded in a noisy background.
We use the following notation: Sl, l = 1, 2, denotes the l–th component
(‘spatial map’) reconstructed by the algorithm, with domain Ω. Ωl ⊆ Ω is
the activated region, with support the lighter area. Bl are “background”
noise processes, whose supports are the complements in Ω with respect to
105
the activated regions, Bl = Ω \ Ωl. We will also find useful to define B,
the complement in Ω with respect to the union of the activated regions,
B = Ω\ (Ωl∪Ωl′). The activated regions of the two components may overlap
in Ill′ = Ωl ∩ Ωl′ . Note that since the overlap, Ill′, differs among different
cases, |B| is not constant. Finally, Zl′ = Ωl′ \Ωl, where l′ = 2− l+ 1, is ‘the
other zone’, i.e. the “footprint” of the other activation in this source. The
sources were then defined as
Sl(v) = χΩl(v)xl,v +
[1− χΩl
(v)]yl,v, l = 1, 2 ,
where χA(v) is the characteristic function on A and where x1, x2, y1, y2 are
four independent random variables of which, as v ranges over Ω, the xl,v and
yl,v are independent realizations. Finally, the mixtures were generated by
X(t1, v) = 0.5S1(v) + 0.5S2(v)
X(t2, v) = 0.3S1(v) + 0.7S2(v).
Each activated spatial process had a cumulative distribution function (CDF)
of the form Φx(u) = 11+e2−u or Φx(u) = 1
1+e2(2−u) , depending on the example.
The background process had a CDF of the form Φy(u) = 11+e−1−u . The
probability density functions of x and y are then ϕx(u) = Φ′x(u) and ϕy(u) =
Φ′y(u). The above functional forms were chosen so that, for the first form, the
ICA algorithms used provided “optimal detectability”, as the nonlinearity
used in the InfoMax ICA algorithm was 1(1+e−x)
, adapted to heavy tailed
components, while, for the second, there was a slight missmatch, as can be
expected in real applications. Finally, the non-linear function approximating
the negentropy in the FastICA algorithm was g(y) = y2 and the “symmetric
approach” of the algorithm was used.
Each case is then defined by two parameters:
106
1. The overlap between activated regions, Ω1 and Ω2. This also defines
the spatial (in)dependence between S1 and S2, according to Golden [77]
and Daubechies et al. [46]; see below.
2. The functional form of the PDF (CDF) of the activations, Φ1,x(·) or
Φ2,x(·).
Geometrically, the objects to be defined for each case were then: Ωl, Zl′, Bl,
Ill′ , as defined above (note that l′ = 3 − l). Golden [77] and Daubechies et
al. [46] give the condition for spatial (in)dependence between S1 and S2, as:
p12 = p1p2 ⇔|Ω1 ∩ Ω2||Ω| =
|Ω1||Ω| ·
|Ω2||Ω| . (3.22)
Now, the cases are described in p. 10418 (i.e. p. 4) of Daubechies et
al. [46]. In particular,
• Case 1: In this example, Ω1 = 11, . . . , 40 × 21, . . . , 70, and Ω2 =
31, . . . , 80 × 41, . . . , 80. By Eq. 3.22, S1 and S2 are independent.
For the CDF, Φx(u), we choose Φ1,x(u) = 1/1 + e2−u.
• Case 2: All choices are identical to Example 1, except that the CDF, Φx,
is picked differently: Φx(u) = Φ2,x(u) = 1/1+ e2(2−u). The components
S1 and S2 are still independent.
• Case 3: Example 3: A different variant of Example 1. The location of
Ω2, now shifted to 43, . . . , 92 × 53, . . . , 92. The new Ω2 and Ω1 do
not intersect, i.e., S1 and S2 are spatially separated, not independent.
The PDFs of S1 and S2 are unaffected.
• Case 4: The sets Ω1 and Ω2 are as in Example 3, but Φx is as in
Example 2; S1, S2 are not independent.
107
Regarding the above four cases, we note that Cases 1 and 2 should be sepa-
rable by ICA, while Cases 3 and 4 need not be.
A source separation algorithm is deemed successful in separating the com-
ponents in this task if there is no remaing “shadow” of ‘the other zone’. This
is reflected in the histograms as a line-up of the corresponding areas. In
particular, if Zl′ in the source Sl cannot be differentiated from B, then the
algorithm was successful. For overlapping sources (that is, overlapping Ω1,
Ω2, cases 1 and 2, where the overlap is Ill′ 6= ∅), we must test for the addi-
tional requirement that Ill′ is not discernible from Ωl.
The results of applying ICA on the data set, taken from [46], is shown
in Fig. 3.30. Both Infomax ICA and FastICA fail to separate Case 2, while
they manage to separate the non-independent Case 4. On the contrary, as
can be seen from Fig. 3.31, the SCA-IT model successfully separates all four
cases. The threshold τ was empirically set to 0.1 for all four cases for this
experiment and we run 500 itarations of the algorithm.
Real Brain Data
We next apply the SCA-IT algorithm to a particular real-world fMRI ex-
periment. This is an fMRI timeseries used by Friston, Jezzard, and Turner
in [71]. It was chosen because it is small enough to illustrate the SCA/IT
model in an actual neuroimaging analysis situation without requiring special
programming techniques for handling very large datasets50. Results from
more complex fMRI data and comparisons with the state-of-the-art will be
presented in Chapter 5.
For this data set, brain scans were obtained every 3s from a single subject
using a 4.0–T scanner, resulting in 5mm-thick image slices of size 64 × 64
voxels through the coronal plane. Sixty scans were aquired. The stimula-
50Furthermore, it allows us to compare results with a stochastic (Monte Carlo) solution,in the future.
108
Figure 3.30: Simulated fMRI data. Unmixing the 4 mixtures of rectangularcomponents described in the text. (Left) PDFs of original 2 components, ofcomponents as identied by InfoMax and by FastICA. Color coding: the whole(blue), the active region (red), the ‘background’ (green); In the ICA outputs,the purple PDF corresponds to the zone associated to ‘the other component’;the background PDF is then for the area outside Ω1 and Ω2. Separation iscompletely successful only when the purple and green PDF line up, i.e., incases 1, 3 and 4, but not in case 2. (Right) False-color rendition of unmixedcomponents for Cases 2 and 4, as obtained by InfoMax, FastICA, and amore sophisticated ICA algorithm that learns PDF distributions [6], [151].InfoMax and FastICA do better in Case 4 (separated components) than inCase 2 (independent compoenents); for the pdf-learner it is the converse.(Figure from ref. [46].)
109
Case 1: Component 1
0
200
400
600
800
1000
1200
-4 -3 -2 -1 0 1 2 3 4
Case 1: Component S1 S1 Ω1 B1 Z2 I12
Case 1: Component 2
0
200
400
600
800
1000
1200
1400
-2 0 2 4 6
Case 1: Component S2 S2 Ω2 B2 Z1 I21
Case 2: Component 1
0
200
400
600
800
1000
-3 -2 -1 0 1 2 3 4
Case 2: Component S1 S1 Ω1 B1 Z2 I12
Case 2: Component 2
0
200
400
600
800
1000
1200
-4 -3 -2 -1 0 1 2 3 4
Case 2: Component S2 S2 Ω2 B2 Z1 I21
Case 3: Component 1
0
200
400
600
800
1000
1200
-4 -3 -2 -1 0 1 2 3 4
Case 3: Component S1 S1 Ω1 B1 Z2
Case 3: Component 2
0
200
400
600
800
1000
1200
-4 -3 -2 -1 0 1 2 3 4
Case 3: Component S2 S2 Ω2 B2 Z1
Case 4: Component 1
0
200
400
600
800
1000
-4 -3 -2 -1 0 1 2 3
Case 4: Component S1 S1 Ω1 B1 Z2
Case 4: Component 2
0
200
400
600
800
1000
1200
-5 -4 -3 -2 -1 0 1 2 3 4
Case 4: Component S2 S2 Ω2 B2 Z1
Figure 3.31: Results of the SCA-IT model applied on the simulated data ofsubsection 3.4.3, for Cases 1–4. Each separated component is shown withits corresponding set of histograms for the areas mentioned in the text to itsright. The color scheme is the following: component (Sl) blue, activated area(Ωl) red, background (Bl) green, ‘other zone’ (Zl′) magenta, and intersection(Ill′) yellow. The green vertical line has abscissa equal to the backgroundmean, and the red one has abscissa equal to the activation mean. The goalof the separation is to align the histograms of the background and ‘otherzone’ so as to be indistinguishable, without ‘ghosting’ residuals; the same forthe activation and intersection regions.
110
tion, provided by light-emitting goggles at 16Hz, followed a block design,
alternating between ‘off’ and ‘on’ states. In particular, it was off for the first
10 scans (30s), on for the next block of 10, off for the third, etc. See ref. [71]
for more neuroscience-specific details. The original data were interpolated
to 128× 128 voxels in [71], so we followed the same procedure, bringing the
data to the same resolution by simple bicubic interpolation, in order to be
able to compare results.
The results from a Statistical Parametric Mapping (SPM) analysis of
the data from Friston, Jezzard, and Turner [71] are shown first. Figure 3.32
shows the extracted spatial map and Fig. 3.33 the corresponding timecourse.
Due to the nature of this method, which essentially uses correlation in order
to extract activated voxels, large “blobs” of voxels spanning large brain areas
appear to be activated; we see a typical result of that method in the spatial
map of Fig. 3.32. The thresholded spatial map (Fig. 3.34) gives a clearer
picture, showing regionally specific activations. A p–value of 0.05 was used,
resulting in a threshold of 3.97. Figure 3.34 shows the resulting ‘excursion
set’. We will use this, semi-supervised, result as our ‘golden standard’ to
assess the quality of the separation of SCA for this dataset.
We compared this result with the result from an Independent Component
Analysis (ICA) of the same data. ICA is widely used as one of the major
unsupervised learning methods in fMRI analysis and is now regarded as the
standard for ‘model-free’ analyses. As in Daubechies et al. [46], for each of
the separated components we computed the correlation coefficient, r, between
the associated timecourse, al(t), and the ‘expected timecourse’, that is the
time-paradigm, ϕ(t), convolved with the haemodynamic response function,
h(t). The component with the highest value of r was identified as the CTR
component map, C(v).
The analogous ICA decomposition, however, did not produce a single
spatial map that exhibited all the “features” (voxel clusters) shown in the
111
Figure 3.32: Spatial map from SPM (Statistical Parametric Mapping) forthe real brain dataset (coronal plane). (From Friston et al. [71].)
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 20 40 60 80 100 120 140 160
Figure 3.33: Timecourse from SPM (dots). The stimulus (‘time-paradigm’,ϕ(t)) is the box-like, continuous curve (blue). (From ref. [71].)
112
Figure 3.34: Thresholded spatial map from SPM using threshold u = 3.97(p-value 0.05) (from ref. [71]).
SPM spatial map. That is, although the ICA CTR map s1 (i.e. the “best”
component) is quite good in terms of correlation coefficient (r = 0.916) and
seems to capture most of the spatial pattern, comparing it with the excur-
sion set of Fig. 3.34 it is evident that it lacks certain details. In order to
capture the missing spatial information, we selected additional IC compo-
nents, provided that their corresponding timecourses correlated most with
the stimulus (above a certain standard threshold, r > 0.3). (It turned out
that only one additional component was eventually required. Its correlation
coefficient was r = 0.896.) Figures 3.35 and 3.36 show the resulting spatial
maps, s1, s2, and associated timecourses, a1, a2, from ICA. The component
maps were ranked according to their similarity to the SPM map, taken here
as the “ground truth”. The missing details of the spatial map s1 indeed
appear in the second spatial map, s2. In other words, and comparing the
ICA maps with the reference thresholded SPM map of Fig. 3.34, we see that
activation information is spread between two components in this case.
113
Figure 3.35: Spatial maps s1, s2 from ICA.
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
10 20 30 40 50 60
3rd IC: timecourse a3
IC timecourseExperimental design
Expected CTR timecourse
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
10 20 30 40 50 60
4th IC: timecourse a4
IC timecourseExperimental design
Expected CTR timecourse
Figure 3.36: Timecourses a1, a2 from ICA, corresponding to the spatial mapss1, s2 of Fig. 3.35.
114
Sparse Component Analysis
We run SCA-IT on the dataset in order to detect ‘Consistently Task Related’
(CTR) (“predictable”) components. Figure 3.37 shows the CTR spatial map,
s1, extracted by SCA. The corresponding timecourse of the component, a1,
shown in Fig. 3.38 shows a very good correspondence with the expected time-
course. The correlation coefficient was rSC1 = 0.940. This component can be
interpreted as one that is due to a single physiological process. It matches
the result from Friston et al. [71] very well. Note, however, that SCA is
completely unsupervised. That is, as in ICA, the prototypical timecourse is
estimated from the data itself. In the CTR map produced by SCA, however,
(Fig. 3.37), the pattern of activation appears in the same component, as it
should be. Recalling that the reason for the decomposition was to decouple
spatial variation from the temporal evolution of the brain activations in a
parsimonious manner in the first place, these results shows that SCA per-
forms qualitatively better in extracting those “prototypical timecourses” and
their corresponding CTR maps from the data than ICA.
Following Daubechies et al. [46] again, in order to quantitatively com-
pare the spatial maps of the above decompositions, we computed the receiver
operating characteristic (ROC) curves between the (thresholded) SPM map,
taken as a benchmark (‘gold standard’), and the SCA and ICA maps, re-
spectively51. The aim is to quantify the similarity of the spatial patterns
produced by SCA/ICA and SPM. ROC curves capture the trade-off between
‘sensitivity’ and ‘specificity’ of an algorithm as a graph between the false pos-
itive rate, FP (or ‘false alarms’), taken as the abscissa, X, and true positive
rate, TP (or ‘hits’), taken as the ordinate, Y . If we threshold an extracted
spatial map at a certain threshold, γ, we get a binary image in which voxels
51It is standard practice, in the use of component analysis for fMRI, to threshold theCTR component obtained. If a binary ground truth of activation is available, one mayassess the quality of the results by means of ROC curves. Typically such “ground truth”maps are not available, however [46].
115
Figure 3.37: Spatial map s1 from SCA. The component whose correspond-ing timecourse correlated most with the stimulus/expected timecourse wasidentified as the CTR component.
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
10 20 30 40 50 60
Timecourse a1 from SCA decomposition
SC timecourseExperimental design
Expected CTR timecourse
Figure 3.38: Timecourse a1 from SCA, corresponding to the spatial map s1
(thick curve). The time-paradigm, ϕ(t), is the blue curve and the expectedtimecourse, h(t)⊗ ϕ(t), is the green one.
116
above the threshold, sl,v ≥ γ, are assigned a label 1, meaning ‘activated’, and
those below it are assigned a label 0, signifying ‘non-activated’ voxels. As
we set the threshold lower and lower, more and more activated voxels will be
included in the ‘activated’ set, but so will false positives. The ROC curve is
constructed here by varying the threshold, γ, of the CTR component maps
between their minimum and maximum values. Then the pair (FP, TP ) be-
comes a γ-parameterized curve, (X(γ), Y (γ)), between the points (0, 0) and
(1, 1). A random (chance probability) result would result in a diagonal linear
segment between these two points, while a perfect result would be a Γ–
shaped curve between the points (0, 0)–(0, 1)–(1, 1). In reality, ROC curves
fall between these two extremes in the ROC space.
In the reference paper of Friston et al. [71], only a 36×60 window (‘region
of interest’, ROI) was used. We performed the analysis into components on
the whole images, but we selected the same ROI after the decomposition for
comparison purposes. Because we did not have the origin of their window
available, we estimated it using a simple correlation matching scheme. The
result is shown in figure 3.39. The peak of the correlation indicated a shift
of origin to coordinates (xo, yo) = (24, 38). Visual inspection showed a near
perfect match.
The result of the ROC analysis, shown in figure 3.40, shows a consid-
erable improvement in decomposition between the ICA and SCA maps. In
particular the ‘ROC power’, as measured by the ‘Area Under the Curve’
(AUC), was A′SC = 0.861 for the SCA CTR component, and A′
IC1= 0.801
and A′IC2
= 0.721 for the best and second-best ICA components, respec-
tively. Finally, we follow Skudlarski [157] and choose an acceptable false
positive probability of 0.1, shown as the vertical line in Fig. 3.40.
We found the selection of the wavelet basis not to be very critical for the
results, as long as we retained a certain level of smoothness. The Daubechies’
most-compactly-supported wavelet D8 was used in the example shown here.
117
20
40
60
80
100
120
140
160
20 40 60 80 100 120 140 160 180
Figure 3.39: Correlation matching between the window used by Fris-ton et al. [71] and the full dataset. The peak of the correlation “image”clearly appears near the center of the rectangle, as a dark “dot”.
For the fMRI data-set discussed here, the run took approximately 10min
CPU time for 2000 iterations on an Intel ‘Centrino Duo’TM processor using
a program written in the MatlabTM language (running the GNU Octave free
software implementation under the Linux OS)52.
Since the defining characteristic of our SCA method is its emphasis on
sparsity, and recalling the analysis in the paper of Daubechies et al. [46], we
conclude that this result is due to imposing a more efficient, sparsity prior
on the spatial maps.
We finally note that, as with any mathematical model applied to ‘real-
world’ data, the semantic interpretation of the results of SCA decomposition
can only be performed according to knowledge in the particular application
domain.
52This time is for decomposing the data in L = T = 60 components. While every effortwas made to produce a fully vectorized program, no other optimization was attempted.We expect that using a compiled language will reduce the execution time considerably.
118
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
SCAICA1ICA2
Figure 3.40: Receiver Operating Characteristic (ROC) curves between theSCA-IT derived CTR map and SPM (red curve) and ICA-derived and SPMones (green curve: best ICA component, s1; blue curve: second best ICAcomponent, s2). The diagonal line represents the line of chance-accuracy.The further above this line the ROC curve is, the better the algorithm underinvestigation performs. This can be quantified using the ‘area under thecurve’, AUC. The vertical line separates the acceptable region of the curve(to the left of the vertical line) from the unacceptable one. In the neurosciencecommunity, the acceptable probability of false positives is in general very low,resulting in a strict threshold. The upper bound of 0.1 for FP in this textcomes from Skudlarski [157].
119
3.5 Setting the Threshold
An important modelling decision is the setting of the threshold, τ , in the
algorithm. This value could be set empirically as in subsection 3.4.1 to
obtain a balance between sharpness and noise [45].
We note here that if we pick the value of the threshold in the vicinity
of the “optimal” one then the algorithm converges to a good answer. If the
threshold is set too high, however, we may get overblurred results and if it
is set too low we get noisy ones. One can in theory re-run the experiments
by altering the value of the threshold and picking the value that optimizes
the ROC power, for example. However this is not a very practical approach
in practice, because there is no analytic expression that relates these two
quantities, so one has to iterate through the whole process many times, until
one finds the optimum value for τ . This is very computationally demanding.
More importantly, in practice in exploratory analyses one may not have the
ground truth in the first place.
Another option is to try to estimate it from the data as well. We use
the following heuristic. First observe that due to the bilinear nature of the
optimization functional of Eq. (3.11) with respect to (A,S), we may ex-
change scale factors between these two quantities without the energy, E ,being changed, provided that we rescale the other quantity accordingly. We
exploit this indeterminacy by making the mixing vectors, al, to be unit vec-
tors, el(θl), by rescaling their 2–norms to be equal to unity, ‖al‖2 = 1. In
effect, we let al evolve on the surface of a unit hypershpere, SD−1, with
only their direction angles, θl, be free parameters. Li et al. use a similar
constraint in [111]. We also set λA to be equal to 1. (Since the purpose of
the penalty on A (third term of Eq. (3.11)) is to keep it from growing too
big, and since we already rescaled al to unity, this does not really effect
the solution.) The above makes E[· · · ;λC, λA
]a function of the Lagrange
120
multiplier λC only.
Now, the obvious but important observation is that the actual threshold
value should depend on the relative magnitude of the wavelet coefficients
to be thresholded, with respect to the ones to be retained. Recalling the
empirical definition of sparsity given in Section 3.2.2 and Footnote 13, that
“most” coefficients should be “small” and only “a few of them” should be
“significantly different than zero” (the former being the ones that should be
filtered out and the latter the ones to be retained), one should give a concrete
formula implementing this concept. For an ℓ1 norm prior, an estimate of the
‘dispersion’, δcl, of the wavelet coefficients of the l–th source, cl,λ, around
their (zero) mean is
δcl=
1Λ
∑Λλ=1 cl,λ
2
2, (3.23)
based on a first estimate of the values53 of cl,λ. In doing this, we implicitly
assumed that the cl,λ are equivalently drawn from a Laplace distribution with
scale parameter 1/√δcl
(Zibulevsky et al., [176]).
We also make an estimate of the observation noise level, lν , as the vari-
ance of a white Gaussian process using a technique inspired by the ‘universal
thresholding’ technique of Donoho and Johnstone [53], by decomposing the
observations, X, in wavelet space and assuming that the finest-scale coeffi-
cients, xt,λJΛJ
λJ=1, are due to noise. A robust estimate of the square root
of the noise level is given by the median absolute deviation (MAD) of the
wavelet coefficients at the smallest-scale (Donoho, [54]),
σνMAD= Kmad(xλJ
) , (3.24)
53This estimate can be obtained from a previous decomposition, for example. For theclass of signals we are interested in in this application, a first ICA decomposition may givea very good starting point. Note that we are not so much interested in a precise estimateof the actual values of the wavelet coefficients of the sources here, but rather on gettingan estimate of the ‘summary statistic’, here scale parameter, δcl
.
121
where K = 1/Φ−1(3/4) ≈ 1.4826 and Φ−1(·) is the inverse of the cumulative
distribution function for the standard normal distribution. The median ab-
solute deviation (MAD) robust statistic, a measure of dispersion, is defined
as
mad(xi) = mediani(|xi −mediani(xi)|) . (3.25)
(Another option could be to use a technique proposed by Beckmann and
Smith for their probabilistic ICA algorithm [15], and consider the smallest
eigenvalues of the data matrix as corresponding to noise.) The above estimate
is used for the first ‘epoch’54 of training. A better estimate is obtained in
the second phase of the algorithm by employing the empirical statistic of
Eq. (3.8), as
σ2ν =
1
(T − L)NTr[(X−AS)T(X−AS)
], (3.26)
using the estimates of (A,S) from the previous phase, where the denominator
is the ‘effective number’ of parameters [147].
Using the above, we finally estimate the threshold, τl, using the equation55
τl = 2lν√δcl
. (3.27)
A schematic representation of the above steps is shown in Fig. 3.41.
The above procedure works quite well in practice. An alternative proce-
54An epoch is defined here as a batch of iterations during which the control parameterssuch as the threshold, τ , are held constant and only the unknown signals and modelparameters are updated.
55This equation corresponds to viewing the terms of the optimization functional,Eq. (3.11), in terms of the equivalent stochastic quantities described above, making use ofthe χ2 for the observations interpretation of Eq. (3.8) with Σ = σ2
XID, and multiplying
through by 2σ2X
. Note that the value of the optimizer does not change if we do this, sinceargminx
cE(x)
= argminx
E(x)
, ∀c ∈ R⋆. The above results in λcl
= 2σ2X
wl, with
wl.= 1/
√δcl
.
122
lν τ δC
X (S,A)
i = 0
i = 1, . . .
Figure 3.41: Schematic representation of the Iterative Thresholding algo-rithm for blind signal separation as implemented in our system. The red boxrepresents the IT algorithm proper (Alg. 2), parameterized by the threshold,τ . Arrows represent flow of information between the various variables in themodel. During the initialization step, i = 0, the data, X, along with aninitial value of S are used in order to obtain a first estimate of the thresholdusing the procedure described in the text (dashed arrow and right-hand sideblue and green arrows, respectively). The algorithm is run for a first epochusing this estimate, after which we obtain a new estimate of (S,A). This isused to obtain a better estimate of the threshold, which we use for the finalepoch run.
dure could be one that finds a “universal” (i.e. typical) scale parameter δcl
for this class of problems, as was done in [150], for example. In Chapter 4,
however, we propose a more principled, hierarchical Bayesian approach to
estimating the threshold. In particular, we propose to extend the determin-
istic IT method and use a variational approach for blind inverse problems in
a probabilistic, generative modelling framework. This will allow us to learn
under uncertainty and estimate important parameters of the model from the
data itself. We take the approach of mean-field variational bounding to the
likelihood and derive a variational, EM-type algorithm for MAP estimation.
Under this formulation, a “soft” version of iterative thresholding emerges as
a module of an extended set of update equations that include estimators for
the controlling ‘hyperparameters’ of the model.
123
Chapter 4
Learning to Solve Sparse Blind
Inverse Problems using
Bayesian Iterative Thresholding
4.1 Introduction
In the previous Chapter, we formulated the optimization functional for
SCA, Eq. (3.11), from an energy minimization viewpoint (Olshausen and
Field, [137]). This objective function explicitly optimizes for maximal spar-
sity, capturing the physical constraint of minimum entropy decomposition
(Foldiak, [67]; Zemel, [175]; Olshausen and Field, [137]; Harpur and Prager,
[80]; Field, [64]; Barlow, [10]; Barlow et al., [13]; Atick and Redlich, [4], [5]).
The variational functional, E , is parameterized by two Lagrange multipliers,
λC and λA, for the two penalty terms corresponding to the expansion coef-
ficients of the sources and to the mixing matrix, respectively. Setting these
correctly can be quite tricky, often leading one to resort to ad-hoc methods.
124
Another delicate issue, directly related to the above, is the choice of the
threshold value, τ , in the iterative thresholding algorithm, used during the
optimization of the objective.
In order to address these issues, one source of inspiration might be anal-
ogous ideas developed in the field of Bayesian statistics. An example of this
methodological approach is the interpretation of ‘early stopping’ in neural
networks training as Bayesian regularization by MacKay [117]. Olshausen
and Field [138], Lewicki and Sejnowski [109], Girolami [76], Li, Cichocki,
and Amari [111], and Teh et al. [163] also propose sparse and ovecomplete
representations in a probabilistic framework stemming from the pioneering
work of Olshausen and Field [137]. The setting of blind inverse problems is a
natural candidate for the application of the Bayesian probabilistic methodol-
ogy, allowing us to incorporate ‘soft’ constraints in a natural manner, through
Bayesian priors (Tarantola, [162]; Idier, [92]; Calvetti and Somersalo, [25]).
In the Bayesian framework, one can flexibly encode weak assumptions about
the object of interest, such as smoothness, positivity, etc. (Sivia, [156]). Li
and Speed [110] note that the idea of a ‘well-defined statistical model’ is the
statistical counterpart of the notion of well-posedness from a probabilistic
perspective. A well-defined statistical model should have properties such as
identifiability of the unknowns, existence of an unbiased estimate, and stabil-
ity of the estimates in the statistical sense, e.g. variance, robustness, etc. By
putting probabilistic constraints on the unknowns, the inherent ill-posedness
of many inverse problems can be overcome. As an example of a parallel
between the classical and statistical view of regularization, the equivalent of
Tikhonov regularization is the method of ridge regression in statistics. In
general, any variational regularization scheme of the form of Eq. (3.1) is for-
mally equivalent to Bayesian inversion (Borchers, [20]), with the second term
corresponding to a prior distribution on the unknowns. The interpretation
of the solutions, however, is different. Instead of a single “best estimate”
125
of the classical solution (the minimizer, x⋆), the Bayesian approach infers a
probability distribution over all possible solutions, x, given the data, i.e.
a ‘posterior’, P (x|y). This is the Bayesian solution to the statistical inverse
problem [25]. Only then may one choose the most probable solution under
the model, for example, as the “representative” of the ensemble of all posible
solutions. Note that in this text we maintain the posterior distribution over
the latent variables throughout inference, however. The Bayesian approach
to inverse problems is elaborated in section 4.2.
Other related work
In the statistics community, several authors have proposed statistical ver-
sions of classical deterministic regularization methods. O’Sullivan [60] gives
a statistical perspective on ill-posed inverse problems and suggests the use of
the mean-square error as a tool for assessing the performance characteristics
of an inversion algorithm. He bases his method on extensions of the Backus-
Gilbert averaging kernel method, used in geophysics. He also proposes some
experimental design criteria based on these ideas. Fitzpatric [65] gives a
Bayesian analysis perspective on inverse problems and relates the ‘Bayesian
maximum-likelihood’ estimation to Tikhonov regularisation. He proposes
the expectation-maximisation algorithm to the problem of setting regular-
isation levels, and develops a framework for infinite-dimensional Bayesian
analysis on Banach spaces. Li and Speed [110] give a statistical counterpart
of the notion of well-posedness in the idea of a well-defined statistical model
(which should possess identifiability of the unknowns, existence of an un-
biased estimate, or something close to this, and reasonable stability of the
estimates in the statistical sense, i.e. variance, robustness, etc.), and develop
a deconvolution method for positive spikes, the synchronisation-minimisation
algorithm, based on the Kullback-Leibler divergence. Calvetti and Somersalo
126
[25] especially emphasize the relationship between inverse problems and sta-
tistical inference under the Bayesian framework. This is actually a two-way
interaction, in that interesting inverse problems pose significant computa-
tional challenges, leading to novel Bayesian analysis, on the one hand, and
Bayesian analysis provide “priorconditoners” for stabilizing the ill-posedness
of inverse problems, on the other. Finally, Ghahramani studied the solution
of inverse kinematic problems using the EM algorithm [72].
Proposed Method
In this Chapter we revisit sparse blind source separation and cast it into
a stochastic framework, in which the unknown source signals are modelled
as realizations of stochastic processes. We view blind inverse problems
from a statistical data modeling perspective (Attias, [6]), and the search for
their solution as a statistical inference problem (Abramovich and Silverman,
[1]). We adopt the Bayesian framework (Sivia, [156]) as the foundation for
modelling and inference and for characterizing uncertainty in the parameters
of the model.
The wavelet-based SCA model presented in the previous Chapter can also
be interpreted within the probabilistic framework. Based on the functional E ,we present here the derivation of the corresponding probabilistic model. This
will enable us to build a hierarchical model and employ Bayesian inference in
order to estimate the above mentioned and other model parameters from the
data itself, instead of setting them by hand1. It will also allow us to estimate
error bounds and assess the uncertainty in the model. While the model of
the previous Chapter provided only point estimates of the unknown signals,
the Bayesian model proposed in this Chapter incorporates uncertainty in the
data and the model into the estimation algorithm.
1Other approaches include the L–curve method, the discrepancy principle, and cross-validation.
127
We propose a generative probabilistic model for blind inverse problems
under sparsity constraints based on a variational lower-bound maximization
approach. While our method is more generally applicable, we discuss in
detail the particular inverse problems of
• Separating signals and datasets into their constituent components. This
is the multivariate inverse problem of blind source separation, also
known as the ‘cocktail party’ problem.
• Extraction of weak biosignals in noise, with applications to biosignals
such as fMRI and MEG. Our main interest/operational application of
this work is especially in the exploratory analysis of neuroscientific data
by decomposing functional MRI images into components.
The method, however, is not tied to source separation and can be applied to
a variety of linear blind inverse problems under sparsity constraints.
For the inversion/solution of the above problems we propose the method
of Bayesian Iterative Thresholding (BIT). This algorithm exploits the spar-
sity of latent signals to perform blind source separation and signal reconstruc-
tion. The key to our approach is the development of a generative probabilistic
model that captures these properties using Bayesian priors, and which will
be solved using Bayesian inversion ideas. The algorithm will be derived from
lower-bounding the data likelihood, and will result in a variational EM-type
algorithm that alternates between the estimation of the sources and the mix-
ing matrix.
128
4.2 Bayesian Inversion
Computation of inversion in both the traditional and Bayesian approaches2
usually leads to solving a system of equations. For example, in the deconvo-
lution problem, discretizing the Fredholm integral equation of the first kind
(which is the mathematical model describing the physical process) leads to a
system of linear equations, which we want to invert in order to estimate the
object of interest. In this text we consider the linear measurement model
y = Ax + ε , (4.1)
where the ‘forward’ linear operator A maps the unknowns3, x, to the obser-
vations, y. The observation operator links the latent and observation spaces,
X and Y , respectively. In the Bayesian approach to inverse problems, we
model the signals x and y as random variables. Note that in the two inter-
pretations to the model, determinstic and probabilistic, the interpretation
of the term ε is not the same conceptually: in the Bayesian approach ε is
considered additive random noise, while in the classical method ε is an “er-
ror” term that represents some kind of distance between the data, y, and
the image under A of the “true”, fixed value of the unknown x, y = Ax. In
the Bayesian point of view, randomness reflects our incomplete information
about the system and the measurements.
The Bayesian way of encoding our knowledge or expectations about the
random quantities x and y are by assigning probability distributions to them.
In particular, we specify the prior model, p(x|θ;Θ), and the forward/noise
2As a point of nomenclature, we note that some authors (see e.g. Careira-Perpinan [29])note that the terms ‘inverse problem’ and ‘mapping inversion’ are not exactly equivalent inclassical regularization. Under the Bayesian inference framework, however, the differencebecomes less important, since both the latent variables and the parameters of the mappingtake priors and are treated on an equal footing as part of the estimation.
3The unknowns are usually referred to as “the model” in inverse problems theory anddenoted by m.
129
model p(y|x, θ;Θ), where θ are the model parameters (including the linear
operator A) and Θ are the so-called hyperparameters, higher-level parame-
ters which control the distributions of other parameters. The prior model
captures our degree of belief about a particular value for the unknowns being
“correct” based only on our prior knowledge about the system. The forward
model represents the likelihood, that is the probability of observing y given
a realization of the random unknowns and parameters, p(y|x, θ;Θ). It cap-
tures our belief regarding the degree of accuracy with which the model can
explain the data (“If we knew the unknowns and parameters exactly, what
would the distribution of the data be?”). Given the measurement model, it
is a measure of the “data misfit”. Under Eq. (4.1), it is functionally equal to
pε(Ax− y).
The likelihood describes the measurement part of the model. The prior
contains all information and prior knowledge we have about the unknowns.
Bayesian statistics provide a simple yet extremely powerful way to combine
these two pieces of information together, the Bayes’ rule:
p(x|y, θ;Θ) =p(x|θ;Θ)p(y|x, θ;Θ)
p(y|θ;Θ). (4.2)
In words, the posterior probability of the unknowns, after having measured
the data, is computed by “modulating” our prior model by the likelihood, and
scaled by the denominator. Bayes’ rule is essentially a recipe for updating our
knowledge about a system by observing some measurables, linked together
by a mathematical model. It can also be seen as the mathematical process
of “reversing” the conditional probability of the data given the parameters,
to give the conditional probability of the parameters given the data. The
quantity in the denominator, the evidence for the data, apart from being a
normalization factor ensuring that the left-hand side is a proper probability
density, reflects the fact that, by computing it, we have in effect “integrated
130
Figure 4.1: The Bayesian solution to an inverse problem is a posterior distri-bution. Left: the classical solution corresponds to Pr(x|d) = δ(x−x⋆); Right:Bayesian solution, Pr(x|d), where d is the data. The Bayesian solution takesinto account uncertainties in the model and the measurements.
out” the randomness in the unknowns, x:
p(y|θ;Θ) =
ˆ
dx p(y|x, θ;Θ)p(x|θ;Θ) . (4.3)
We now summarize the main mathematical objects of classical and Bayesian
inversion in table 4.1. In this table, ID(x;y, θ) and IP (x, θ) in the classical
Approach Given Find
Classical inversion y, ID(x;y, θ) and IP (x, θ) x⋆
Bayesian inversion y, P (y|x, θ,Θ), P (x|θ,Θ) and P (θ|Θ) P (x|y)
Table 4.1: Summary of the main mathematical objects of classical andBayesian inversion.
approach are the distortion measure and penalty, respectively, which may
also contain parameters, θ. To reiterate, the Bayesian solution to the sta-
tistical inverse problem is the posterior density. A schematic interpretation
of the posterior as a distribution with width that is a function of the noise
precision and the prior width is shown in Fig. 4.1. The exact functional form
of the posterior for the problem treated here will be derived later in this text.
131
4.3 Sparse Component Analysis Generative
Model
4.3.1 Graphical Models
We now introduce a graphical modelling formalism for generative models [7].
This formalism allows the modular specification of probabilistic models in
an intuitive, visual way. The probabilistic relationships among the various
elements of the model can be represented as a directed acyclic graph (DAG).
In particular, this graph represents the structural relationships among the
observed data, y (gray node), latent variables, x, and model parameters, θ,
as well as the controlling hyperparameters4 of their distributions, Θ.
Graphs such as the above can be endowed with probabilistic semantics
such as they reflect the probabilistic relationships in a generative model in a
one-to-one fashion. This is done as follows: each directed arc denotes a prob-
abilistic dependence relation between the nodes at its head and tail, named
‘parent’ and ‘child’, respectively. For example x −→ y means “y depends
on x”. This dependence relation is encoded in the conditional probability
p(y|x, · · · ), where ‘· · · ’ here denotes all other parent nodes of y. Combining
all these local relations, for every variable involved, results in a complete
graphical representation of a probabilistic model. We note here there are
three fundamental structures/building blocks in such a DAG, which follow
from, and are consistent with, the above locality property: convergent, di-
vergent, and chain5. Of these three, we will mostly need the convergent and
the chain structures for the model discussed in this Chapter.
The DAG representation for a general latent variable model is shown in
4From the point of view of hierarchical Bayesian modelling, the difference between anode in θ and one in Θ is that the former is assigned a prior probability density. Ultimately,it is the modeller’s choice which variables belong to which set.
5See ISPRS Notes and Appendix 2.1.
132
Θ θ x y
Figure 4.2: A directed acyclic graph (DAG) representation for generativelatent variable models, showing the structural relationships among the ob-served variables, y, latent variables, x, model parameters, θ, and hyperpa-rameters Θ.
Fig. 4.2. We will denote the ‘inner graph’, comprising of the nodes x,y, θ,by G and the augmented graph G ∪ Θ, containing the ‘root’6, Θ, as well,
by G. Such probabilistic graphical models can be learnt via the variational
lower bound maximization method described in section 4.4.
4.3.2 Construction of a Hierarchical Graphical Model
for Blind Inverse Problems under Sparsity Con-
straints
In this section we show how the regularization approach described in the
previous Chapter can be reformulated within a Bayesian probabilistic frame-
work. Conceptually, there are four steps in the derivation of the Bayesian
iterative thresholding approach:
1. Interpret the energy functional E as the cost function of a probabilistic
model,
2. Introduce a latent random variable, S, for the unknown signals,
3. Lower bound the likelihood of data and model parameters,
6A root node is a node without any parents.
133
4. Derive a learning algorithm that maximizes this bound, iteratively in-
ferring the unknown signals and learning the model parameters.
Based on the energy functional, E , we introduce our Bayesian generative
model for blind inverse problems under sparsity constraints as a latent vari-
able model here. This requires the interpretation of constraints as prior
probability distributions. The corresponding learning algorithm will be de-
rived from a lower-bound maximization approach7, by bounding the data
likelihood from below. The estimation equations for each of the variables in
the model and the computation of the lower bound itself will be presented
in the next section.
The key to our approach is the specification of a hierarchical latent vari-
able model for blind inverse problems, such as blind source separation. We
start from the generic form of the standard linear data decomposition of a
dataset,xd
D
d=1, into a set of latent signals,
sl
L
l=1, which is written in
matrix form as
X ≈ AS ,
while the graph corresponding to this problem, in its generative form, is
shown in Fig. 4.3a and is denoted by G ≡ G0. Focusing only the ‘complete
data’, (X,S), the initial graph is S −→ X. In order to make this model well-
posed appropriate constraints should now be imposed. The initial component
analysis graphical model will undergo a series of structural transformations
(Buntine, [22]) that will produce the final hierarchical SCA model. The
changes to the standard component analysis model are best described using
the graphical formalism introduced in subsection 4.3.1. These are shown in
Fig. 4.3b–e and will be discussed next.
7Olshausen also uses a lower bound formulation in [136], by introducing a variationalposterior, Q(ai), over the coefficients of the sparse representation, ai, (i.e. the correspond-ing “sources”). However, he does not consider the ai as latent variables and proposesa gradient optimization approach instead. Moreover, his model is not hierarchical, but
134
(1)
(3)
(2)(4)
A
A A
v
A
X
C
A
G0
G1
G2
(e)
v
XΣ
X
C
S
(a)
ΣX
X
S
X
(b)
C
S
XΣ
Φ
X
S
C
(d)
S
ΣX
Φ
(c)
ΣX
Φ
Φ
Figure 4.3: Structural transformations of the standard component analysisgraphical model, G0, shown in (a). (We only focus on the essential subgraphthat is relevant to the discussion here and, in a sense, forms the “spine” ofthe model.) If a relation between two variables is deterministic, we groupthe couple within a dotted box and denote the node at the tip of the cor-responding edge, i.e. the ‘children’ node, with a double circle. That nodecan be deleted from the graph, giving the equivalent graphs in the secondrow. This transformation is denoted by a dotted red arrow. Adding nodesto a graph is denoted by a continuous-line red arrow: operation (1) adds awavelet node, C; operation (3) adds a source noise covariance node, v.
135
In terms of generative modelling, the connection between smoothness
spaces and wavelets, introduced in subsection 3.2.2, suggests that one can
use a ‘parametric free-form’ solution (Sivia, [156]) based on wavelets in order
to synthesize the unknown signals8. In particular, the foundation of our
approach is the expansion of the latent signals in a wavelet basis, Φ, as
S.= S
def=(ΦCT
)T. (4.4)
This step introduces the structural constraint of equation (3.10) in the graph
G. As in the deterministic method, this captures the constraints of spar-
sity and (generalized) smoothness. The wavelet coefficient matrix, C, is a
parameter in θ.
In terms of the graphical model of Fig. 4.3, the above substitution is
equivalent to replacing node S by the pair of nodes (C, S) via operation (1),
where the two nodes are deterministically related by equation (4.4). After
this expansion, the graph becomes9 θ −→ X, i.e. C −→ X, and is shown in
Fig. 4.3b or, equivalently, in Fig. 4.3c.
At this stage, the optimization functional related to the graph of Fig. 4.3b,
when combined with the corresponding penalty term of Eq. (3.9) on C and
the growth-limiting term on A, corresponds exactly to the energy function
E introduced in equation (3.11).
Bayesian Interpretation
Under a probabilistic framework, the functional E can be equivalently in-
terpreted as a maximum a-posteriori (MAP) cost function. Following a line
of reasoning similar to that of Harpur and Prager [80] and Olshausen [136],
two-layer, and he uses a maximum likelihood framework.8This is also reminiscent of the ‘functional data analysis’ idea [144].9Again note that θ = C,A, but we mainly empasize C here.
136
from a generative point of view one can interpret the error term, EX, in the
above energy formulation as an implicit observation noise.
Data model and likelihood. In particular, the data is modelled as a
linear mixture of underlying sources with additive random sensor noise, EX
via observation equation
X = AS + EX . (4.5)
The sensor noise is assumed to be a matrix-variate normal (MVN) with
zero mean and row covariance matrix ΣX: EX ∼ N (0, (ΣX, IN)). Based
on the above, the likelihood for parameters and sources under the model
(i.e. the probability of the data conditioned on the sources, given the model
parameters) is
P (X|S,A,ΣX) =1
det (2πΣX)N/2exp
−1
2tr[
(X−AS)TΣ−1X (X−AS)
]
.
(4.6)
The ‘fidelity’ term of the energy-based method, ‖X−AS‖2, then becomes
the negative log–likelihood function of the observations given S and A, and
gets a prefactor − 12σ2
X
, where ΣX = σ2XID is the noise variance. The corre-
spondence is
P (X|S,A,ΣX) =1
ZX
e−EX ,
where
EX =1
2σ2X
‖X−AS‖2 and ZX =(2πσ2
X
)ND/2.
Priors. Under the Bayesian framework, both parameters C and A will be
assigned a prior distribution. In anticipation of the generative formulation,
the penalty terms are written as log–priors in the MAP objective. Using the
expansion of equation (4.4), the optimization problem can then be written
137
as [176]
max(C,A)
J (C,A), (4.7)
with
J def= logP (C,A|X)
= − 1
2σ2X
∥∥∥X−A
(ΦCT
)T∥∥∥
2
+∑
l,λ
log(e−βlSD(cl,λ)
)+ log
(
e−12a‖A‖2
)
+ const.,
where SD(cl,λ) is the sparsity function on cl,λ from equation (3.9), with
cl,λ =(Φ−1sl
)
λ, and the third term is a quadratic prior on the mixing matrix
as per equation (3.11). In the second and third terms, we made explicit the
dependence on the width (‘scale’) hyperparameters, βl and a, respectively.
The last term is a constant coming from log–normalizing factors that do not
depend on either C or A, and can be dropped from the equation. We finally
note that the ‘hyperparameters’ βl and a of the source and mixing models are
implicitly incorporated in the Lagrange multipliers λC and λA if we define
the correspondences10 2σ2Xβl → λC and σ2
Xa → λA. As discussed in sub-
section 3.2.2, these hyperparameters have a profound impact on the value of
the optimization functional. Therefore, here we propose a more principled
method in order to estimate them directly from the data.
The last form of our optimization functional formally coincides with the
sparse optimization problem of Zibulevsky and Pearlmutter [176], Eq. (3.9),
p. 869. Zibulevsky and Pearlmutter formulated the same probabilistic model
using an analogous reasoning. However, the optimization there was per-
formed directly on the J (C,A)–surface, under the constraint ‖al‖ ≤ 1,
l = 1, . . . , L, using gradient descent. Our solution will be different, lead-
ing to a Bayesian extension of the iterative thresholding algorithm for linear
10The quantity σX
1/√
βis often called the ‘Tikhonov factor’.
138
inverse problems of Daubechies et al. [45], as will be explained next.
Learning. In MAP estimation one wants to maximize the joint probability
of data and parameters11
f(θ)def= p(X, θ) . (4.8)
One could now optimize this objective directly with respect to the vector
of parameters, θ, via a gradient-based algorithm, for example, to get the
optimum value
θMAP = argmaxθ
log p(X|θ) + log p(θ)
.
Since C is a parameter, one can obtain an estimate of the wavelet coeffients of
the sources this way. The sources can then be estimated via Eq. (4.4). This
is the approach followed by Zibulevsky12 [176]. Here, however, we adopt a
Bayesian approach by marginalizing out the unknown source signals. That is,
we consider the unknown signals to be random latent variables. This requires
the specification of a new, generative model for SCA, discussed next.
Hierarchical SCA latent variable model
A ‘hidden’ variable, h (where h is the source node, S, in this case), can
be introduced in the graph G as shown in operation (3) of Fig. 4.3. The
deterministic node, S, introduced previously, becomes a parent of S and,
as will be shown later13, in the Bayesian framework, represents a kind of
mean “message” sent to S by S. The node S can be deleted from the final
graph, Fig. 4.3e, but can be retained in the equations, if needed. (The latent
11The quantity f is equivalent to L for the data likelihood. It is going to be replacedby the joint probability over data, latent variables and parameters, X,h,θ, next.
12Zibulevsky et al. [177] also propose the use of a uniform prior on the mixing matrix,A, so the corresponding log–prior can be dropped.
13See equations 4.24 and 4.25.
139
variable S also depends on a covariance node, v, which is also introduced in
G as a parent of S. The reasoning behind its inclusion will be discussed in
subsection 4.3.3.)
The random variable h is assigned a probability density, p(h). The joint
probability of data and parameters from equation (4.8) can now be rewritten
as
f =
ˆ
h
p(X,h, θ︸ ︷︷ ︸
G
)dh ,
integrating out the latent variables (Minka, [127]). It is worth noting that
here we have a density in the form of a mixture of models, p(X, θ|h), weighted
by the probability of h. This way, the uncertainty in h is “folded into the
estimation process”. This is a natural formulation for component analysis
models, where the data are “explained” by a set of hidden variables that
form an “abstraction” of the data according to some desired criterion, such
as minimum variance, dependence, or sparsity; cf. the graph of Fig. 4.3a.
This gives the likelihood for the observations, L(θ), in the form
p(X|θ) =
ˆ
dS p(X|S, θ)p(S|θ) ,
where the likelihood for parameters and sources is given in Eq. (4.6) for the
SCA model. The joint distribution of data and model parameters is then
P (X, θ) =
ˆ
dSP (X,S, θ) =
ˆ
dSP (S|θ)P (X|S, θ)P (θ) ,
where, again, we have integrated over all possible values of the latent variable
S. The factorization of the joint probability of X,S, θ inside the last
integral follows from the Markov properties of the graphical model, G, shown
in Fig. 4.2.
The final DAG for the latent-variable sparse component analysis model is
140
shown in Fig. 4.3d, or, equivalently, in Fig. 4.3e. This figure shows wavelet-
based SCA as a hierarchical graphical model. The “backbone” of this model
is the top-down Markov chain C −→ S −→ X, capturing the generative pro-
cess from the wavelet coefficients to the sources to the observations. In the
Bayesian graphical modelling framework, the whole subgraph that ends in a
node is actually the prior for that node. For the SCA model this means that
we regard the unknown source signals, S, as being generated by a wavelet
synthesis procedure, whose building blocks are elements ϕλ ∈ Φ, and this
procedure is part of its prior.
We can now summarize the generative process for the data as follows.
A set of basis functions, Φ = ϕλΛλ=1, is used to synthesize L unobserved
source signals, slLl=1, with corresponding wavelet coefficients clLl=1, where
cl = (cl,1, . . . , cl,Λ), as
S ≈(ΦCT
)T.
These sources in turn are mixed through a mixing matrix A =[Adl
]to
generate the observation signals xdDd=1, with xd = (xd,1, . . . , xd,N):
X ≈ AS .
In signal space, the mappings among the various signals are14
ΦC−→ S
A−→ X .
Note that, due to the presence of noise, these are stochastic dependencies (At-
tias, [6], [7]); see next. The collection θ = C,A comprises the parameter
set of our model and will be learned from the data. We call the corresponding
model dictionary-based sparse component analysis (Db-SCA).
14All signals are in RN , where N is the number of data points, but the wavelet basesϕλ have compact support.
141
The DAG, however, only shows the structural relationships among vari-
ables in the model. The model specification will be completed and made
more precise in subsection 4.3.3, by the introduction of specific probability
distributions for the various random variables in the model. Probabilistic
(“soft”) constraints are encoded in the form of prior distributions on the
various random variables in the model, capturing our domain knowledge or
modelling decisions. A central one for this model is that we aim at modelling
and analyzing signals that are sparse in some appropriate basis, or frame.
This will be captured by using an appropriate form for the prior on the
expansion coefficients of the latent signals, p(cl,λ|αl, βl), in subsection 4.3.3.
4.3.3 Latent Signal Model: Wavelet-based Functional
Prior
A signal model in the probabilistic framework is composed of a prior density,
combined with any functional relationships among the variables. A key point
of our approach is that the sources are regarded as latent random variables,
depending on the wavelet bases in a non-deterministic way.
In the probabilistic model we will let the “states”, S, (adopting Field’s ter-
minology) be noisy by introducing a state noise variable, ES, on the sources
and model the sources as
S =(ΦCT
)T+ ES , (4.9)
where Φ is a N×Λ matrix that stores the basis functions, ϕλ, in its columns.
The relation between C and S, which was deterministic in Eq. 4.8 and in
Zibulevsky et al. [176], becomes stochastic here. This formulation serves
several purposes:
142
• It models ‘system’ noise that may be present in inherently noisy source
signals, such as biosignals and brain processes15. In particular it may
be valuable in the separation/extraction of ‘activations’ investigated by
MEG, EEG or fMRI, for example.
• It models approximation errors and residuals from the wavelet synthesis
model; this is especially useful for the case of transforms that do not
offer perfect reconstruction or in truncated wavelet transforms.
• The use of the functional prior with Gaussian source noise simplifies
inference as well. In particular, the use of a rich dictionary of functions,
such as a wavelet dictionary, enables us to transfer the complexity of
modelling the sources in the upper layer of the network of Fig. 3.10.
Moreover, it enables us to compute the data likelihood analytically:
the node S decouples X from C, as it will explained next. The sources
conditioned on their parents are Gaussian, therefore it is easy to com-
pute the posterior sufficient statistics of the sources, 〈S〉Q and⟨SST
⟩
Q,
needed for learning the model parameters (see Eqns (4.24) and (4.26)).
Thus, it enables us to use the ‘complete data’ concept in the VEM
algorithm, making inference tractable.
The practical result of including a state noise model in the prior is that the
algorithm infers a set of denoised (filtered) sources, by attributing a portion
of system noise to the corresponding variable.
By judiciously chosing the characteristics/properties of the family/basis
(those properties that better match our prior notion of importance), certain
features/attributes of the signals can be emphasized. We note that there is no
15There are many sources of ‘noise’ in such signals, such as the inherent stochasticity ofneural sources and the fact that what we actually measure are some macroscopic physicalquantities and not the actual neural activity. This, combined with the fact that neuralsignals are very weak, compared with other sources, makes signal separation in neuralstudies quite hard a task.
143
universally “best” dictionary as such: wavelet families can be tailored to suit
specific needs and trade-offs, such as compact support versus smoothness16.
Nevertheless, in many cases, families have been found to give very good
performance for a variety of problems. In any case, our framework is general
and can accomodate any family.
The latent signal model for the l–th source of the functionally-constrained
SCA model can be summarized as:
l :
sl = Φcl + εsl, Φ =
[ϕ1, . . . ,ϕλ, . . . ,ϕΛ
],
P (cl|αl, βl),
P (εsl|vl),
(αl, βl), vl,
(4.10)
where the corresponding rows in equation (4.10) are: the functional model
and the wavelet synthesis matrix, the sparse prior on the expansion coef-
ficients, the ‘state’ noise model, and the hyperparameters controlling the
sparsity prior and state noise.
The source noise, ES, is assumed to be Gaussian with diagonal covariance
matrix V = diag(vl). The system noise model is therefore
l : P (εsl|vl) = N (0, vlIN), (4.11)
Finally, the density P (cl|αl, βl) captures the sparsity constraint, and will
be elaborated next.
16Mallat gives a vivid analogy between the use of wavelet bases/dictionaries and filteringout irrelevant information as “noise” by humans in noisy environments [121]. In a sense,this is exactly what source separation is about.
144
Modelling Sparsity: Heavy-tailed Distributions
The use of wavelet expansions result in parsimonious representations of a
broad class of signals. Sparsity is reflected in the characteristic shape of
the empirical histograms of wavelet coefficients, exchibiting distributions
highly peaked at zero. Indeed, many types of real-world signals exhibit this
behavior—see subsection 3.2.2 for an analysis of which types of signals are
expected to have a very sparse representation in wavelet bases. This is also
the key for the successful use of wavelets in signal compression. One can ar-
gue that the underlying physical reason is indeed ‘shortest description’ [175],
[96]. In addition to theoretical analyses (see eg. Donoho [50]), numerous ex-
perimental studies have provided additional evidence for sparsity (Simoncelli,
[155]; Moulin and Liu, [129]; Jalobeanu, [96]).
In the SCA model, the sparsity constraint for the sources is encoded in the
model of their wavelet coefficients, p(cl,λ|θcl), in combination with our choice
of wavelet basis, Φ. In the probabilistic framework, the weighted ℓp prior
used in the energy functional, E , corresponds to the generalized exponential
distribution,
P (cl,λ|αl, βl) =1
2Γ(1/αl)
αlβ1/αll
exp(
− βl|cl,λ|αl
)
, λ ∈ Icl. (4.12)
The use of a supergaussian prior, which puts the bulk of the probability
mass around zero, is a means to control the complexity of the representation.
By also letting the controlling hyperparameters, (αl,βl), be estimated from
the data, the generalized exponential model has proven a versatile tool for
modelling a wide class of signals.
Figure 4.5 demonstrates the excellent fit of the generalized exponential
density to the empirical distribution17 of a wavelet coefficient sequence of a
17The bin-width for the histogram can be estimated estimated using e.g. the Freedman-
145
Figure 4.4: An image exhibiting smooth areas and structural features. Satel-lite image of Amiens, France, downscaled to 256× 256 pixels — c© IGN.
natural image, a satellite image of Amiens, France (downscaled to 1024×1024
pixels), shown in Fig. 4.4.
4.4 Inference and Learning by Variational Lower
Bound Maximization
In the iterative thresholding algorithm of Daubechies et al. [45], a sequence
of surrogate functionals was constructed such that the problem of optimiz-
ing the original functional, I[s], is replaced by the easier problem of it-
eratively optimizing the sequence
I[s; zi]
iof simpler functionals, where
zi is the minimizer of the surrogate functional at the previous iteration,
zi ← s⋆ = argmins I[s; zi−1].
Our method derives an iterative thresholding algorithm for blind inverse
problems in the probabilistic framework. This was driven by the need to
take uncertainty into account and to estimate critical quantities and param-
eters of the model, such as the value of the threshold and the observation
Diaconis formula [68], bin size = 2 iqr(x)N−1/3, where iqr(·) is the interquartile range ofthe data sample, x = xn, and N is the number of data points in x. Note, however, thatthe histogram is used only for display purposes.
146
0
0.1
0
Empirical histogram of ck GE(c; µ=0.0, α∗ =0.75319, β∗ =13.870)
Figure 4.5: Heavy-tailed distributions for wavelet coefficients. Generalizedexponential densities with learned hyperparameters can have an excellent fitto experimental data for a wide range of problems. This graph shows onesuch fit (with α = 0.753, β = 13.870) to the sparse wavelet decomposition,cλ, of a natural image, at a particular scale (the satellite image of Amiensof Fig. 4.4, at scale j = 7, using the ‘Symmlet–8’ basis).
noise variance, from the data. An additional goal was to adapt to the spar-
sity characteristics of the unknown signals, by learning the strength of the
penalties (shape and width hyperparameters of the Bayesian priors). Now,
in the probabilistic framework, an analogous variational approach can be
used for intractable or computationally demanding models. The main idea
is to approximate/replace the data likelihood by a simpler form that is more
tractable. As shown in section 4.4.1, this new functional is a lower bound
to the likelihood, and plays a role analogous to the surrogate functionals
of DDD. As discussed in subsection 4.2, the Bayesian answer to an inverse
problem question is a posterior probability density. This object is not always
easily computable, though. This section derives a tractable approximation
to the posterior that, at the same time, minimizes the distance between the
data likelihood and its lower bound. We envisage to apply our model to
large datasets, such as biosignals. The problem becomes variational due to
the probabilistic assumptions about the structure of interactions of the data
points in the variational posterior. This will be discussed in more detail in
147
subsection 4.4.1.
4.4.1 Variational Lower Bounding
Consider a model with hidden variables x, observed variables y, and model
parameters θ, parameterized by a set of hyperparameters Θ, whose general
structure is as in Fig. 4.2. Our aim is to maximize the function f(θ) =
p(y, θ), i.e. the joint distribution of data and parameters.
Now, for any valid probability distribution, Q(x), on the latent variables,
x, the functional
F[θ, Q(x)
]=
ˆ
dxQ(x)p(x,y, θ)−ˆ
dxQ(x) logQ(x) (4.13)
forms a lower bound to f(θ). To see this, we take the joint distribution again
and write it as
log p(y, θ) = log
(ˆ
dx p(x,y, θ)
)
= log
(ˆ
dxQ(x)p(x,y, θ)
Q(x)
)
.
To obtain the lower bound, we apply Jensen’s inequality18 and move the
logarithm inside the integral:
log p(y, θ) ≥ˆ
dxQ(x) log
(p(x,y, θ)
Q(x)
)
= F(θ, Q)
The difference between the lower bound, F , and the joint probability, f , is
the Kullback-Leibler divergence between the trial distribution Q(x) and the
posterior P (x|y, θ), KL(Q||P ) ≥ 0, which is a measure of the “distance”
18Jensen’s inequality isϕ (E [X ]) ≤ E [ϕ(X)] ,
where ϕ a convex function and X is an integrable random variable.
148
between those two distributions:
KL[Q(x)||P (x|y, θ)
]=
ˆ
dxQ(x) log
(Q(x)
P (x|y, θ)
)
.
Indeed, by writing F(θ, Q) as
F(θ, Q) =
ˆ
dxQ(x) log
(p(x,y, θ)
Q(x)
)
=
ˆ
dxQ(x) log
(p(x|y, θ)p(y, θ)
Q(x)
)
=
−ˆ
dxQ(x) log
(Q(x)
p(x|y, θ)
)
+ log p(y, θ) =
−KL(Q||P ) + log p(y, θ) ,
we finally get
log p(y, θ) = KL[Q(x)||P (x|y, θ)
]+ F(θ, Q) ≥ F(θ, Q)
We want to find the distribution Q(x) that makes this bound as tight
as possible. By functionally differentiating F[θ, Q(x)
]with respect to Q(x)
under the normalization constraint´
dxQ(x) = 1,
∂
∂Q(x)
F[θ, Q(x)
]+ λ
(ˆ
dxQ(x)− 1
)
= 0 ,
we find that the optimal variational distribution is gotten if
Q(x) = P (x|y, θ) , (4.14)
i.e. if it is set to the true posterior distribution over the latent variables. The
Kullback-Leibler divergence KL(Q||P ) is also zero when Q = P and positive
otherwise. Therefore the functional F[θ, Q(x)
]is a strict lower bound to
149
log p(y, θ). Thus, minimizing F[θ, Q(x)
]with respect to the variational
distribution Q(x) is equivalent to minimizing the Kullback-Leibler divergence
KL[Q(x)||P (x|y, θ)
]between F and f .
Note that we only derived the general form of the lower bound, F , here,
in order to make the connection with the variational approach of DDD (and
introduce the latent variable S). The derivation of the NFE for the SCA
model will be done in section 4.4.3.
The variational algorithm for MAP estimation therefore uses the lower
bound F , also called the (negative) ‘variational free energy’ (NFE), as opti-
mization functional,
F[
Qx(x),y, θ,Θ]
=⟨
logP (x,y, θ︸ ︷︷ ︸
G
|Θ)⟩
Q+H
[
Qx(x)]
, (4.15)
where the density P is parameterized (conditioned) in terms of the hyperpa-
rameters, Θ. In the above equation, the first term is the average ‘variational
energy’ and the second term is the entropy of the variational posterior, Q.
The average, 〈·〉Q, is over the latent variables, x, with respect to their pos-
terior, Q. In the left-hand side, we have made explicit the dependence on all
relevant variables.
The first term, in turn, decomposes into two terms: the (negative) ‘energy’
of the global configuration of the system of observed and latent variables,
x,y, comprising the ‘complete data’, which is parameterized by θ and Θ
and defined to be
Eθ,Θ(x,y)def= − logP (x,y|θ,Θ) ,
and the log–prior of the parameters, logP (θ|Θ).
Note that unlike the optimization function of the EM algorithm for maxi-
150
mum likelihood estimation (the so-called ‘Q–function’), the variational MAP
F–function depends on all the elements,x,y, θ
, of the graph (model) G
(see Fig. 4.2), allowing the incorporation of priors on the parameters. This
is essential for our model, because we want to impose the sparsity constraint
on the representation of the latent signals in a signal dictionary.
The negative free energy, F , serves two purposes: (a) It allows us to
derive the estimation equations for the unknown quantities in the model,
and (b) It allows us to monitor the convergence of the algorithm.
Due to the Markovian properties of the graphical model G, which trans-
late to additivity in log–space, the NFE is decomposed in terms involving
individual contributions from each element of G. The expression of F for
SCA is computed in Sect. 4.4.3.
Inference and Learning
Variational lower-bounding of p(y, θ) by F[θ, Q(x)
]leads to an estima-
tion algorithm that iteratively finds a good bound, i.e. minimizes the the
Kullback-Leibler divergence KL[Q(x)||P (x|y, θ)
], and subsequently opti-
mizes this bound, with respect to the model parameters.
Our algorithm is structured around two nested phases: an inner phase
involving the estimation of the latent variables and model parameters and an
outer phase of the estimation of hyperparameters. These are repeated for a
number of ‘epochs’19 until convergence. After convergence, they are followed
by a source reconstruction stage.
The computations of the inner phase can be summed up as: “Integrate
over the latent variables and optimize over the parameters”. This is a varia-
tional version of the Expectation-Maximization algorithm (Neal and Hinton,
19An epoch is defined as a batch of iterations during which the hyperparameters areheld constant and only the parameters and latent variables are updated. After eachepoch, a hyperparameter update is performed and the internal loop is run again until(local) convergence; see Alg. 3.
151
[132]; Minka, [127]). We estimate the model parameters, θ, (learning) in the
variational M–step as
θ = argmaxθ
ˆ
dx Qx(x)[
logPΘ(x,y, θ)]
. (4.16)
The variational posterior over latent variables (inference) is obtained in the
variational E–step by the functional optimization:
∂
∂Qx(x)F[
Qx(x);y, θ,Θ]
= 0 ⇒ Qx(x) = P(
x∣∣∣y, θ,Θ
)
, (4.17)
using the results obtained at the previous iteration for the rest of the model.
In practice, this step is equivalent to computing the sufficient statistics of
the distribution Qx(x).
The assignment in equation (4.17) leads to the variational distribution
that makes the lower bound tight. However, using the exact posterior as Q
does not buy us much if it leads to intractable computations. The bound
in equation (4.13) is valid for any variational distribution Qx(x). To be of
practical use, we must choose a distribution that will allow us to evaluate the
lower bound (4.13) and the estimation equations (4.16) and (4.17). There is
a trade-off to be made, therefore, between tractability and quality of approx-
imation. Subsection 4.4.1 discusses the issue of inference in large datasets.
For the SCA model in particular, the node X comprises the observations
and the node S the latent variables. The set of model parameters, θ, is
C,A and the set of hyperparameters, Θ, is (α,β),v, a, σ2X. The union
of these node-sets gives the graphical model G. Each of these variables will
be estimated in subsection 4.4.2. The free energy of the system becomes then
F[
Qx(x),y, θ,Θ]
= F[
Q(S),X;C,A; (α,β),v, a, σ2X
]
. (4.18)
152
Its actual expression will be derived in subsection 4.4.3.
Large datasets: Structured variational mean-field approximation
The iterative thresholding framework, as described in Daubechies et al. [45],
is completely general within the setting of sparse linear inverse problems: it
only requires that objects s ∈ S are mapped to objects x ∈ X via a bounded
operator, A, under the sparsity constraint. These assumptions reflect a wide
class of real-world signals.
BSS is inherently a multidimensional problem. Data are usually regarded
as D observations from D corresponding sensors, xdDd=1. However, there is
no restriction in thinking them as one observation, X =(x1, . . . ,xd, . . . ,xD
),
from a single “sensor”. Analogously, one can consider the L ‘spatial maps’
(sources),sl(vn)
L
l=1, where vn denotes a voxel here, as a collection of L
separate functions in RN (in the discrete case), or as one function S(vn) in
RL×N . Depending on our point of view, therefore, the problem can either be
seen as a single, 1×1 mapping S 7→ X, or L×D smaller ones. Recalling the
observation equation in matrix form, X = AS + EX, The first view implies
that we can search for the spatial components S = S(vn) as a single object,
the minimiser, S⋆, of the functional J .
Based on the probabilistic semantics of the SCA graphical model (the ‘ex-
plaining away’ phenomenon: see figure 4.3a), the source posterior is coupled
over data points, vn. (See Fig. on p. 68 of Saul’s paper and pp. 47–49 of
Beal.) This creates the problem, however, that we must compute statistics of
(D ×N)–dimensional random variables (the sources) and estimate and ma-
nipulate (DN ×DN)–dimensional matrices (the observation operator). For
many real-world problems, such as the analysis of functional MRI (where
D ∼ O(102) and N ∼ O(105)), this is still computationally very demanding.
A central assumption that we will make here, therefore, is that the poste-
153
rior over S factorizes over data points n (but not over latent dimensions20,
l):Q(S) =
∏
n
Q(sn). (4.19)
This is akin to a mean-field approximation in statistical physics, where the
probability over ‘sites’ in a system of “particles” is approximated by a factor-
ized distribution by ignoring fluctuations in local fields and replacing them
with their mean value. The equivalent “sites” here are the source data points,
sn. This assumption will allow us to compute covariance matrices of size
D×D, instead of DN×DN . This formulation leads to a variational learning
algorithm, as discussed in subsection 4.4.1. The estimation equations for all
variables in the model are presented next.
4.4.2 Estimation Equations
In this section we show that the optimum reconstruction algorithm takes
the form of generalized iterative thresholding. Inference and learning in
the SCA model involves four estimation problems: estimation of the sources
themselves and estimation of the source model, mixing model, and noise
model parameters. The variables in the SCA model may be also partitioned
into groups related to the above estimation problems:
• The ‘source’ model, GS = S,v ∪GW, where GW = Φ,C, (α,β) is
the wavelet submodel,
• The mixing model, GM = A, a,
• The sensor noise model, GN = ΣX, and
• The data, GD = X.20A fully factorized posterior, also called a ‘naıve mean-field’ approximation, although
computationally easier, misses the correlations in the source posterior, often giving sub-parresults.
154
These subgraphs are such that GS∪GM∪GN∪GD is again the whole graphical
model, G.
The full set of update equations for the BIT-SCA model is shown next.
In the following, all averages, 〈·〉, are with respect to the variational posterior
distribution over S, Q(
S∣∣∣X, θ,Θ⋆
)
, since this is the only latent variable in
the model (all other quantities are estimated at their most probable values).
We take an ‘empirical Bayes’ approach and estimate the hyperparameters
of the priors from the data.
We will present the estimation equations in groups corresponding to the
source, mixing, and noise submodels, GS, GA, and GΣ, respectively.
Bayesian Iterative Thresholding
Remark 15 Bayesian iterative thresholding forms the core of our blind in-
version algorithm. It imposes the sparsity constraint and incorporates the
sparsity prior in the computation of the posterior quantities.
It operates within the source submodel, GS.
Inference: Sources. The update equations for the sources compute the
posterior sufficient statistics for the node S. For the general case of a multi-
variate jointly Gaussian pair of random variables, (x,y), under the graphical
model x −→ y with y observed and x latent, the posterior mean vector is
µx|y = ΣxxT
[(Σxx
T)−1
µx + AT
[(AΣxxA
T + Σy|x)−1
(y −Aµx)]]
,
(4.20)
which can be re-written as
µx|y = ΣxxT
[[(Σxx
T)−1 −
(
ATΣ−1A)]
µx + AT(Σ−1y
)]
, (4.21)
155
in order to better highlight the contributions from the parent and children
nodes of node x, and the posterior covariance matrix is
Σx|y = Σxx −ΣxxT(ATΣyy
−1A)Σxx . (4.22)
In these equations, Σyy denotes the total noise in the data, which is the sum
of the sensor noise and the ‘system’ noise (i.e. source noise), propagated
through the observation operator, A,
Σ ≡ Σyy = Σy|x︸︷︷︸
sensor noise
+AΣxxAT
︸ ︷︷ ︸system noise
, (4.23)
in order to emphasize the contributions from the parent and children nodes.
Applying these to the SCA model, the source means are computed as
Sdef= 〈S〉 = VT
[(VT)−1
S + AT(AVAT + ΣX
)−1 (X−AS
)]
, (4.24)
where the lth row sl is given by
l : sl = Φcl (4.25)
for l = 1, . . . , L, and can be interpreted as the first-order (mean) ‘message’21
from the parents ϕλ weighted by the coefficients cl,λ. The quantity
X −AS is the residual, r, in observation space. The variance of the source
noise, V = diag(vl), is computed using Eq. (4.29).
Equation (4.21) is to be contrasted with the equivalent deterministic
quantity,(I−ATA
)s(i) + ATx, which is the argument of the thresholding
operator, Sτ (·), in the estimation equation (3.17). We can see that in order
21For an interpretation of EM-type algorithms as message passing, see Eckford [56] andDauwels et al. [47].
156
to get the original IT, the covariance matrices V and Σ must be implicitly
set to be the unit matrices22 IL and ID, respectively23. We can also compare
Eq. (4.24) with the equivalent equation in the classical gradient descent solu-
tion, f (i+1) = η(i)HTg +(I− η(i)HTH
)f (i): the ad-hoc “learning rate”, η(i),
which is, in general, tricky to set, can be explained probabilistically as the
ratio vl/σ2. This is automatically set in this model.
Finally, the second-order posterior sufficient statistics is
⟨SST
⟩= 〈S〉〈S〉T +NCSS , (4.26)
where the covariance matrix, CSS, is
CSS = V −VT(ATΣ−1
X A)V , (4.27)
for l = 1, . . . , L.
Remark 16 Note that the posterior covariance matrix, CSS, is given by
the prior covariance matrix, V, after we subtract the second term of the
equation (4.27): this means that observing the data, X, makes the poste-
rior distribution Q(S) narrower, since we obtain more information about the
system.
Source noise. We estimate the precision hyperparameter (inverse vari-
ance) of the l–th source noise, 1/vl, using the following statistics of the
22Note, however, that the quantity estimated in Eq. (4.24) is the posterior mean ofthe random variable S. On the contrary, the sources in the original IT are deterministicquantities, and the source noise “variances” are vsl
→ 0.23Σ = · · · , E = · · · .
157
Gamma distribution24 Ga (bvl, cvl
):
l :
c⋆vl= 1
2N
b⋆vl=(
12
[
tr(
Cslsl
)
+(〈sl〉 − clΦ
)(〈sl〉 − clΦ
)T])−1
,(4.28)
where N is the cardinality of the set sl,1, . . . , sl,N, which forms the children
of the node vl, and the quantity in square brackets in the second equation is
the second-order posterior statistics of the residual information in the node
sl. The source noise variances, v = (v1, . . . , vL), are then computed as
v⋆l =
(b⋆vlc⋆vl
)−1, l = 1, . . . , L . (4.29)
Wavelet coefficients: Bayesian shrinkage. We now show how the wavelet
coefficients are updated by fusing information from the sparse prior and the
data, i.e. the parent and children nodes of C, respectively. This leads to a
Bayesian derivation of a ‘wavelet shrinkage’ rule. The resulting update rule
contains soft thresholding as a special case.
The update equation for C is gotten by minimizing F with respect to C.
We have:
C = argmaxC
∑
l
−1
2vl
[
−2〈sl〉 (clΦ)T + clcl
]
+∑
l
p(cl|αl, βl)
,
(4.30)
and
cl = Φ−1sl, for l = 1, . . . , L . (4.31)
24The Gamma distribution is suitable for modelling scale parameters (positive quan-tities) and is conjugate to a Gaussian likelihood with unknown mean. For a Gamma-
distributed random variable x ∼ Ga(b, c) = 1Γ(c)bc xc−1e−
1
bx, its sufficient statistics are 〈x〉
and 〈log x〉. Here we consider it at the uninformative limit, c → 1, 1/b → 0, where thebars denote information coming from the prior of x. Therefore its parameters are updatedusing information from the data only.
158
An important special case is obtained if we use the Laplace prior,
log pl(cl|βl) ∝ −βl‖cl‖1 = −βl
∑
λ
|cl,λ|, (4.32)
corresponding to the ℓ1 prior of Daubechies et al. (where p = 1 is small-
est allowable value of the exponent of the constraint in the algorithm of
DDD). The update equation for the wavelet coefficients then becomes the
soft-thresholding function Sτ : c 7→ c in this case (see Fig. 4.6):
l : cl,λ = sgn(cl,λ)(|cl,λ| − τl) I|cl,λ|>τ⋆l , λ = 1, . . . ,Λ , (4.33)
applied component-wise. In the above equation, I|x|>x⋆ is the indicator
function25, defined as I(x) = 1 if |x| > x⋆ and I(x) = 0 otherwise (see
Fig. 3.8 for a representative plot). This operation sets all values for cλ in
[−τ ⋆, τ ⋆] to 0, where the optimal value for the threshold, τ ⋆, is computed
below.
However, a flexible sparse prior will be used in this model that is a gen-
eralization of the above, which is one that adapts its shape according to
the sparsity of the inferred signal (see Fig. 4.5). In particular, we use a
generalized exponential on cl, defined as
pl(cl,λ) =1
2Γ(1/αl)
αlβ1/αll
exp(
− βl|cl,λ|αl
)
, ∀cl,λ ∈ cl , (4.34)
where the denominator Z(αl, βl) = 2Γ(1/αl)
αlβ1/αll
is a normalizing constant. This
quantity will play a role later, during the estimation of the shape and scale
25Another reason the indicator function is used in the thresholding function (see Fig. 3.8)is consistency: if an input coefficient c0
λ ∈ [−τ, τ ] were just mapped to c0λ = sgn(c0
λ)(|c0λ| −
τ), then due to the shift by τ negative coefficients would become positive and vice versa.The only way to maintain both conditions of Eq. (4.33) for −τ ≤ c0
λ ≤ τ is when c0λ = 0.
159
-12
-10
-8
-6
-4
-2
0
2
4
6
8
10
12
-12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12
Out
put,
thre
shol
ded
coef
fici
ent
Input wavelet coefficient
Thresholding functions
-τ τ
sτ(x)s(τ,δ)(x)
hτ(x)
-12
-10
-8
-6
-4
-2
0
2
4
6
8
10
12
-12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12
Out
put,
thre
shol
ded
coef
fici
ent
Input wavelet coefficient
Thresholding functions
-τ τ
Figure 4.6: Thresholding (shrinkage) functions c 7→ c. Blue curve:hard threshold with parameter τ , hτ (c); red curve: soft threshold, sτ (c); greencurve: a smooth approximation to the soft threshold, sτ,δ(c) = c− vβ c√
c2+δ,
with parameter τ = τ(β, v), with c√c2+δ→ sgn(c) as δ → 0.
parameters, αl, βl, from the data. The update equation for the wavelet
coefficients under a generalized exponential prior then becomes
cl,λ = cl,λ − vl
(αlβl|cl,λ|αl−1sgn(cl,λ)
)I|cl,λ|>τ⋆
l . (4.35)
Noting that the factor −αlβl|cl,λ|αl−1sgn(cl,λ) in the above shrinkage rule is
the ‘score function’, ψ(c), defined as the (negative) logarithmic derivative of
the prior distribution on cl,λ, ψ(c) = −d log p(c)dc
, we can rewrite it as
cl,λ = cl,λ + vl ψl(cl,λ) I|cl,λ|>τ⋆l , where cl,λ =
(ΦT〈sl〉T
)
λ.
We can then give the following interpretation:
Remark 17 The new value of the wavelet coefficient, cl,λ, is computed by
adjusting the information coming from the child of the node cl,λ, i.e. the
160
wavelet analysis of the source sl, by the sparse prior weighted by the variance
of the system (source) noise. For high-precision sources we have vl → 0 and
the input wavelet coefficients pass through the ‘non-linearity’, ψ(·), almost
unchanged; see also the estimation equations for the threshold, τl, in sub-
section 4.4.2, used in the above shrinkage rule, and Fig. 4.6: the shape of
the shrinkage curve tends to the line c = c as τl → 0. On the contrary, in
high-noise regimes the second term will have a significant impact; also, the
value of the threshold will be larger, filtering out all but the largest wavelet
coefficients as noise.
The hyperparameters, (αl, βl), will be estimated in subsection 4.4.2, ‘Adap-
tive Sparsity’, and the value of the threshold, τ ⋆l , will be computed in sub-
section 4.4.2, ‘Thresholds’, for both cases26.
Learning
In blind inverse problems, we must also estimate the observation operator, A.
In addition, in order to perform automatic denoising, the parameter of the
observation noise, ΣX, must be estimated. The rest of the equations are the
precision of the mixing matrix, a, and the updates for the hyperparameters
of the wavelet prior, (α,β). The latter are discussed in subsection 4.4.2.
Under the Bayesian framework, constraints (priors) on the parameters
can naturally be imposed. These enter the corresponding update equations
so as to “steer” the estimated values of the parameters towards preferred
regions of the parameter space.
26The case for the Laplace prior is simply a special case with the shape parameter, α,set to 1.
161
Mixing model The update equation for the parameter, A, of the mixing
model, GA, is given by the implicit equation
A =(
X 〈S〉T −ΣXaA)( ⟨
SST⟩ )−1
, (4.36)
which is obtained by differentiating F with respect to A.
For isotropic observation noise, ΣX = σ2XID, the update equation can be
solved explicitly, to give:
A =1
σ2X
(
X 〈S〉T) (
aIL +1
σ2X
⟨SST
⟩)−1
. (4.37)
In the above equation, the term aIL is the information coming from the par-
ent, a, of A. The hyperparameter a corresponds to the Lagrange multiplier
λA, modulating the corresponding penalty term of the energy functional, E ,if λA = σ2
Xa. Its estimate is given next.
The mixing matrix precision hyperparemeter, a, assuming a Gamma-
distributed variable again, a ∼ Ga (a⋆a, 1/b
⋆a), is estimated as:
a = a⋆a/b
⋆a, (4.38)
where
a⋆a =
1
2DL and b⋆a =
1
2
D∑
d=1
L∑
l=1
(Adl)2. (4.39)
Sensor noise model. The variable ΣX of the sensor noise model, GΣ, i.e.
the sensor noise covariance, is updated next. Optimization of the negative
free energy with respect to ΣX is equivalent to estimating the observation
162
noise covariance at its maximum likelihood value27.
ΣX =1
N
⟨
(X−AS)(X−AS)T
⟩
, (4.40)
where N is the number of data points28. Another option is to use the effective
sample size, Neff = N − p, where p is the number of parameters estimated
from the data.
For isotropic observation noise, ΣX = σ2XID, we obtain
σ2X =
1
DTr(ΣX) , (4.41)
where D is the number of sensors.
Adaptive sparsity: estimating the shape and width of the general-
ized exponentials
In the original IT algorithm of DDD, the exponent, p, of the ℓp penalty
on the wavelet coefficients, and the corresponding Lagrange multiplier, λC,
were chosen a-priori. The hyperparameters, (αl, βl), of the wavelet coefficient
model can be learned from the data as well, by maximizing the variational
objective, F , with respect to these hyperparameters.
For the Laplace prior distribution (α⋆l ≡ 1) on the l–th source coefficients,
cl,λλ, which corresponds to an ℓ1 norm, an estimate of βl based on the
27Strictly speaking, ΣX is a hyperparameter positioned at one of the roots of the graph,according to our model specification. To be a parameter proper, it should also be as-signed a prior, such as an inverse Gamma distribution, ΣX ∼ IGa(bΣX
, cΣX). Following
Ghahramani and Beal [73], we consider it at the limit of bΣX→∞ and cΣX
→ 0 here. Inpractice we recover some of the “subjective information” provided by the prior by settinga minimum allowed variance for the noise, σ2
Xmin, estimated using a modified ‘universal’
threshold (see Donoho, [54]). This works quite well in practice. The model could beextended to cover the informative prior case.
28This is the total number of observations per sensor. For example, in the case of images,N = Nx ·Ny · · · .
163
current values of the wavelet coefficients, cl,λλ, is
β⋆l =
√2
√1Λ
∑
λ (cl,λ)2, where Λ = |cl,λλ| , for all l . (4.42)
Although the choice for the exponent p (equivalent to our α) to be greater
than or equal to 1 in Daubechies et al. [45] and in SCA/IT leads to a convex
problem, we found that it is not flexible enough. Moulin and Liu stud-
ied the subject of ‘Bayesian robustness’ (i.e. the effect of misspecifying
the prior) with respect to sparsity and wavelets [129] and used the prior
π(θ) = a exp (−|bθ|ν), concluding that “the shape parameter, ν, is typically
between 0.5 and 1”. Our experiments support their experimental result. Lep-
orini and Pesquet (2001) also consider the corresponding exponent, r > 0.
The choice of α < 1 has also been dealt with in Kreutz-Delgado et al. [100].
We do not put any restriction on the value of α, other than α > 0, and let it
be inferred by the algorithm29.
For the generalized exponential prior, which corresponds to an ℓp norm,
we follow Everson and Roberts in [58] (the ‘flexible nonlinearity’ approach)
and estimate the hyperparameters α and β at their most probable values.
The relevant part of the NFE to be optimized is
F [· · · ;α,β] =∑
l
−βl
∑
λ
|cλ|αl+∑
l
logαl +1
αl
log βl − log Γ(1/αl)− log 2
︸ ︷︷ ︸
− logZl
.
The two-dimensional parameter space (α, β) (for each latent dimension, l)
is then optimized as follows. Taking the derivatives with respect to βl and
29In practice, for convergence/convexity reasons, we employ a two-phase scheme: we letα be 1 during the first stages of learning and let it be freely adapted and refined at thelater stages, when the algorithm has reached a stable regime.
164
setting them to zero gives a one-parameter family of curves
βl = βl(αl) =1
αl1Λ
∑
l |cl,λ|αl. (4.43)
We can then solve the one-dimensional optimization problem dFdα
= 0 by
taking the total derivative
dFdα
=∂F∂α
+∂F∂β
dβ
∂α= 0 ,
which becomes equal to the first term only, if searched the curve ∂F∂β
= 0.
Then,
∂F∂αl
=1
αl− 1
α2l
log βl +1
α2l
Ψ(1/αl)− βl1
Λ
∑
l
|cl,λ|αl log |cl,λ| = 0 , (4.44)
where Ψ(·) is the digamma function. The above one-dimensional problem
can be solved using any standard nonlinear optimizer. The above procedure
gives the optimal values (α⋆l , β
⋆l )Ll=1.
We will also need to compute the variance of the wavelet coefficients, σ2cl,λ
.
From the generalized exponential density, cl,λ ∼ GE(αl, βl) = 1Z e
−|βlcl,λ|cl ,
1Z = βlcl
2Γ(1/cl), ∀λ ∈ 1, . . . ,Λ, we get
σ2cl,λ
=
(Bl
βl
)
, where Bl =
(Γ(3/αl)
Γ(1/αl
)1/2
. (4.45)
Thresholds
A key feature of the model is that it automatically sets the thresholds, τl,
such that it balances signal reconstruction accuracy and sparseness. The
final reconstruction is an optimal, under the model, trade-off between noise-
less/smooth and over-blurred results.
165
For the Laplace prior, the optimal value for the threshold, τ ⋆l , in the
soft-thresholding rule of equation (4.33) is computed as
l : τ ⋆l = v⋆
l β⋆l , (4.46)
where the optimal values of the hyperparameters βl and vl (starred) are also
estimated by the model. These are computed in subsections 4.4.2 and 4.4.2,
respectively.
For the generalized exponential prior, the threshold value for the shrink-
age function Sτ : c 7→ c of equation (4.35), does not have a simple expression.
In this case, the ‘cutoff values’ τ ⋆l = τl(α
⋆l , β
⋆l , v
⋆l ), defined as the values of c
for which c is zero, must be computed as the roots of the one-dimensional
curve x − vαβ sgn(x)|x|α−1 = 0, β ≥ 0, where x ≡ c. Since this is sym-
metric around 0 it suffices to only solve for x ≥ 0. Then sgn(x) = 1 and
|x|α−1 = xα−1, so the final equation becomes x − vαβxα−1 = 0. This is a
non-linear equation in x and we solve it using a numerical method30.
4.4.3 Computing the SCA free energy
From the definition of negative free energy, F , Eq. (4.15), and the directed
graphical model structure (Figs 4.2, 4.3), the negative free energy decomposes
in terms related to the observations, latent variables and model parameters,
respectivelly, plus an additional term that is the entropy of the latent vari-
ables with respect to their variational posterior, Q(S). We have:
F = FX + FS + FC + FA + FH (4.47)
30For example using the function fsolve of GNU Octave.
166
where the individual contributions are computed as
FX =∑
n
−1
2D log(σ2
X)− 1
2Tr[(1/σ2
X)(xnx
T
n − 2xn
⟨sT
n
⟩AT + A
⟨sns
T
n
⟩AT)]− 1
2D log(2π)
,
(4.48)
FS =L∑
l=1
−1
2N log(2π) +
1
2N log(v⋆
l )−1
2v⋆
l
((
Nσ2Sl,l
+ µslµT
sl
)
− 2〈sl〉sT
l + slsT
l
)
,
(4.49)
FC =
L∑
l=1
Λ
[
− log 2− log(Γ(1/α⋆l )) + log(α⋆
l ) +1
α⋆l
log(β⋆l )
]
− β⋆l
∑
|cl,λ|α⋆l
,
(4.50)
FA = −1
2DL log(2π)− 1
2DL log(a⋆)− 1
2D − 1
2a⋆∑
d
∑
l
A2dl , (4.51)
FH = N1
2(L log(2π) + L+ log(| detΣSS|) . (4.52)
4.4.4 Algorithm
The pseudo-code for the variational mean-field Bayesian iterative threshold-
ing algorithm is shown below (Algorithm 3):
167
Algorithm 3 BIT (MF)
1: Whiten the observations, X2: Let δF = 10−7, δFmin = 10−7, tmaxinner
= 1000, tmax = 10000, epoch← 13: Initialize: let S← S(0), θ ← θ(0), Θ← Θ(0)
4: while δF > δFmin & tsum < tmax do5: t← 16: while δF > δF & t < tmaxinner
& tsum < tmax do7: Calculate S from Eq. 4.258: Infer source sufficient statistics from Eqns. 4.24 and 4.269: Re-estimate the mixing matrix, A, from Eq. 4.37
10: Calculate C from Eq. 5.4011: Learn wavelet coefficients, C, from Eq. 4.3012: Calculate negative free energy, F , from Eqns 4.47 to 4.5213: Let δF = |(F − Fold)/Fold|14: t← t+ 115: end while16: Let tsum = tsum + t17: Let epoch← epoch + 118: Anneal: δF = 0.1 δF19: Estimate wavelet prior hyperparameters, (α,β), from Eqns 4.43
to 5.4220: Estimate the Sensor Noise variance, ΣX, from Eq. 4.4121: Estimate system noise precision, v⋆, from Eq. 4.2922: Compute threshold, τ , by numerically solving x − vαβxα−1 = 0 and
τ ⋆ = τ ⋆(α⋆, β⋆, v⋆)23: Estimate precision of the mixing matrix, a⋆ from Eqns 4.38 and 4.3924: end while
4.5 Discussion
4.5.1 Comparison of ordinary and Bayesian iterative
thresholding
First, we note that the original version of IT, Eq. (3.17), does not regard
the sources as latent variables, but instead operates directly on the unknown
signal, s, and its wavelet transform, s. The output of the algorithm is a point
168
estimate of the unknown signals. Moreover, the smoothing on s is performed
(by the shrinkage rule) using prior information from the ℓp penalty only. In
the Bayesian version, some of the “roughness” in s is attributed to system
noise, and the rest to the complexity of the wavelet representation (via the
wavelet coefficients) itself.
Equation (3.17) was derived by exchanging the original optimization func-
tional, I, with a sequence of surrogate variational functionals, I. In our ver-
sion, the data likelihood is lower-bounded by the negative free energy. Due
to the strict lower bound and the convergence properties of the variational
EM algorithm, the BIT algorithm is guaranteed to converge. The original IT
has also guananteed convergence properties, however, our algorithm is also
guaranteed to increase the NFE at each iteration.
The estimation equations (4.24)–(4.35) and (4.40), (4.41) generalize the
equations of the deterministic iterative thresholding algorithm by including
the precisions of the sources and sensors, 1/vl and 1/σ2X, respectively. The
deterministic version of this algorithm does not take uncertainty into account,
and does not compute expectations of variables over their corresponding
posteriors: the equivalent of Eqns. (4.24) and (4.26) in standard IT is a
‘point estimate’ of the unknown signal, s, at the current iteration step31. The
update equations (4.24) and (4.26), on the other hand, provide a “softer”
estimate of s, by including an estimate of uncertainty in the sensors and
sources.
Remark 18 An interpretation of Eq. 4.24 (see also Fig. 3.10) is the follow-
ing. Each node sl combines two sources of information: one from its parent
nodes, ϕλ, representing our prior model/assumptions for the unknown sig-
nals, weighted by cl,λ and captured in the ‘message’ s, and one from its
31Essentially, the original IT algorithm iterates the current value of the unknown signalvia the fixed-point equation s(n+1) = F
(s(n)
), where the argument to the thresholded
Landweber operator, F, is the previous value of the sought-for (unknown) signal: seeEqns 3.17 and 3.19; it does not “fuse” information from other parts of the model.
169
children nodes, xd, representing the measured data, weighted by Ad,l and
captured in the residuals, r. It then weighs those messages according to their
precisions, 1/vl and 1/σ2X, respectively, in order to fuse these two sources of
information together and update its “state”, captured in its sufficient statis-
tics.
Second, in this text we are interested in blind inverse problems; therefore,
we compute an estimate of the observation operator, A, itself. The Bayesian
approach allows us to treat the mapping from unknowns to observations (i.e.
the mixing matrix) as an uncertain quantity as well, and put a prior on it. It
also allows us to compute its width hyperparameter directly from the data.
We also estimate the parameters of the observation noise model, ΣX.
Third, the original variational functional E is parameterised by two La-
grange multipliers, λC and λA, for the two penalty terms corresponding to
C and A, respectively: setting these correctly is crucial for the successful
estimation of the unknowns. Importantly, we also estimate the hyperparam-
eters of the priors, (αl, βl), and the hyperparameter v of the ‘system noise’
model, which determines the strength (variance) of the noise on the sources.
The scale parameters βl and a determine the amount of penalization of the
two priors, and they are the probabilistic quantities corresponding to the La-
grange multipliers of the original functional, I. These in turn determine the
thresholds, τl = τl(αl, βl, vl), which is of utmost importance for the precise
reconstruction of the unknown signals.
170
4.6 Experimental Results
4.6.1 BIT-SCA: Simulation Results
This subsection illustrates blind source separation using the BIT-SCA algo-
rithm for the ThreeSources dataset of [147], [148] for a typical run. Two
fragments of music (512 samples) sampled at fs = 11.3kHz were mixed with
a fragment of an EEG signal and corrupted with random Gaussian noise with
power 10% of the signal power to give three observated signals. This is in-
tended to be a comprehensive investigation of the behavior of the algorithm
with respect to noise, sparsity, etc.
The algorithm was initialized using a sphering transformation. In par-
ticular, the wavelet coefficients of the sphered observations were computed
and the initial value of the threshold, τ , was set using the ‘universal thresh-
olding’ method of Donoho [54]. The initial value of the sensor noise variance
was computed from the median absolute deviation of the coefficients at the
smallest scale (usually regarded as noise in the classical approach), and the
source noise variance was set to a fraction (e.g. 1/10) of the sensor noise
variance. The Daubecies’ D8 wavelet was used for this test; we found that it
provides a good balance between local support and smoothness.
We applied an adaptive converge monitoring scheme. In particular, an
“annealing” scheme was used that controlled the convergence tolerance of the
algorithm, defined as the difference in the value of the negative free energy,
F , between two iterations: at initialization the tolerance was set to a small
value (10−4, say), by default, and was getting stricter by a power of 10 in
the next epoch. This allowed freer exploration during the first stages of a
run and more restricted later, as we converged to the solution. Neither the
number nor the length of each epoch was set in advance; instead, they were
adaptively reset at every run by this scheme. During the first epoch, we fix
171
the hyperparameter αl to 1. The algorithm converged after 1277 iterations
in less than 10sec on a 2GHz machine32 .
The reconstruction of the sources under 10% additive random noise is
shown in figure 4.7. The results show that BIT-SCA recovered the original
sources with high accuracy. Figure 4.8 shows the scatterplot of the original
vs the reconstructed sources, (sl,n, sl′,n)Nn=1, where l, l′ = 1, . . . , L.
Numerical measures of source reconstruction are shown next: the cor-
relation coefficients between the true and reconstructed sources are r11 =
0.964401, r22 = 0.965299, r33 = 0.962916 (cf. Fig. 4.8). The relative mod-
elling error, εη = ‖si − si‖/‖si‖, was: s1: εη = 0.267, s2: εη = 0.265, s3:
εη = 0.273. The source reconstruction error, εrec, was: s1: εrec = 0.071084,
s2: εrec = 0.069898, s3: εrec = 0.074145. Finally, the crosstalk between the
sources, εxtalk, was: 0.028913. The sparsity of the sources is s1 : 53.71%,
s2 : 58.01%, s3 : 67.97% non-zero coefficients, respectively.
The wavelet-based prior leads to a very sparse reconstruction of the
sources, as is evident from the histograms of the estimated coefficients,
cl,λλ. Figure 4.9 shows a stem-plot of the coefficients and figure 4.10 their
histograms. In particular, the hyperparameters of the wavelet model were
estimated as (α1, β1) = (0.28745, 5.6829), (α2, β2) = (0.28457, 5.7235), and
(α3, β3) = (0.38163, 4.3601). The sparsity of the sources is s1 : 53.71%,
s2 : 58.01%, s3 : 67.97% non-zero coefficients, respectively.
Figure 4.11 shows the evolution of the elements, Iii′ , i, i′ = 1, . . . , DL, of
the matrix
Idef=
((
ATA)−1
AT
)
A
during the run; perfect estimation gives a unit matrix. The matrix after
32Using GNU Octave ver. 3.2.3 and the ATLAS libraries running under Linux.
172
-4-3-2-101234
0 100 200 300 400 500 600
Source3
True sourceEstimated source
-3
-2
-1
0
1
2
3
0 100 200 300 400 500 600
Source2
True sourceEstimated source
-4-3-2-10123
0 100 200 300 400 500 600
Source1
True sourceEstimated source
Figure 4.7: ‘ThreeSources’ dataset, 10% noise. Source reconstruction usingthe BIT-SCA algorithm. In order to ease comparison, the reconstructedsources were overlayed with the original ones.
-3
-2
-1
0
1
2
3
-4 -3 -2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
-4 -3 -2 -1 0 1 2 3 4-4
-3
-2
-1
0
1
2
3
4
-4 -3 -2 -1 0 1 2 3 4
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3-4
-3
-2
-1
0
1
2
3
4
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
-4 -3 -2 -1 0 1 2 3-3
-2
-1
0
1
2
3
-4 -3 -2 -1 0 1 2 3-4
-3
-2
-1
0
1
2
3
4
-4 -3 -2 -1 0 1 2 3
Figure 4.8: ‘ThreeSources’ dataset, 10% noise. Scatterplot of original vsreconstructed sources via BIT-SCA.
173
-15
-10
-5
0
5
10
0 100 200 300 400 500 600
Source3
-8-6-4-20246
0 100 200 300 400 500 600
Source2
-10
-5
0
5
10
0 100 200 300 400 500 600
Source1
Figure 4.9: ‘ThreeSources’ dataset, 10% noise. Wavelet coefficients of thereconstructed sources.
0
0.1
0.2
0.3
0.4
0.5
0.6
-10 -5 0 5 10
Source 3
0
0.1
0.2
0.3
0.4
0.5
0.6
-10 -5 0 5 10
Source 2
0
0.1
0.2
0.3
0.4
0.5
0.6
-10 -5 0 5 10
Source 1
Figure 4.10: ‘ThreeSources’ dataset, 10% noise. Empirical histograms of thereconstructed sources estimated by BIT-SCA.
174
-1
-0.5
0
0.5
1
1.5
2
2.5
406 814 1277
Evolution of the elements Iii’hat of the matrix Ihat = (AT A)-1 AT Atrue
Figure 4.11: ‘ThreeSources’ dataset, 10% noise. Evolution of the elements ofthe matrix I during a typical BIT-SCA run.
convergence is
I =
0.979 0.011 −0.011
0.018 0.996 0.007
−0.034 −0.012 0.970
The final angle difference between the original and estimated mixing vectors,
θll = ∠ (al, al), is θ11 = 2.674, θ22 = 0.894, θ33 = 0.629. Figure 4.12 shows
the evolution of the Amari distance, DA, [81] during the run, measuring the
distance between the true and estimated matrix, up to permutations and
scalings,
DA(A,B) =
N∑
i=1
N∑
j=1
∣∣∣(AB−1)ij
∣∣∣
maxj′
∣∣∣(AB−1)ij′
∣∣∣
+
∣∣∣(AB−1)ij
∣∣∣
maxi′
∣∣∣(AB−1)i′j
∣∣∣
−2N2.
Finaly, figure 4.13 shows the evolution of the negative free energy, F ,
175
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
406 814 1277
Learning curve: Evolution of Amari Distance DA: DAfinal = 0.01563
Figure 4.12: Learning curve. ‘ThreeSources’ dataset, 10% noise. Evolutionof Amari Distance DA between the true and estimated mixing matrix duringa typical BIT-SCA run.
during the run. Note that the jumps at iteration numbers 406 and 814 are
due to the reestimation of the hyperparameters at the end each epoch and
that F is always monotonically increasing, with Ffinal = −0.19199× 104.
176
-0.25 x 1e4
-0.24 x 1e4
-0.23 x 1e4
-0.22 x 1e4
-0.21 x 1e4
-0.2 x 1e4
-0.19 x 1e4
406 814 1277
Evolution of NFE, F, after 1277 iterations (3 epochs): Ffinal = -0.19199 x 1e4
Figure 4.13: ‘ThreeSources’ dataset: BIT-SCA: Evolution of negative freeenergy, F , after 1277 iterations (3 epochs). Ffinal = −0.19199× 104.
4.6.2 Extraction of weak biosignals in noise
We consider again the application of BIT-SCA to the functional MRI dataset
of Friston, Jezzard, and Mathews [71] described in Sect. 3.4.3. We run the
BIT-SCA algorithm on the dataset in order to extract the Consistently Task
Related (CTR) components. The component whose corresponding time-
course correlated most with the stimulus/expected timecourse was identi-
fied as the CTR component, provided that its correlation coefficient, r, was
higher than 0.3. The resulting spatial map and corresponding time-course
are shown in Fig. 4.14 and 4.15. The correlation coefficient was r = 0.941
for this experiment.
To quantitatively assess the quality of the separation we plotted the ROC
curve using the result from the GLM as our ground truth. This is shown in
Fig. 4.16. The AUC was A = 0.893 for this experiment. We can see that the
result is better, to some degree, than the energy-based IT algorithm (where
177
Figure 4.14: Spatial map s1 from BIT-SCA. The component whose corre-sponding timecourse correlated most with the stimulus/expected timecoursewas identified as the CTR component.
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
10 20 30 40 50 60
SC: timecourse s1
SC timecourseExperimental design
Expected CTR timecourse
Figure 4.15: Timecourse a1 from BIT-SCA, corresponding to the spatial maps1 (thick curve). The time-paradigm, ϕ(t), is the blue curve and the expectedtimecourse, h(t)⊗ ϕ(t), is the green one.
178
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Figure 4.16: Receiver Operating Characteristic (ROC) curve between theBIT-SCA derived CTR map and the corresponding map from SPM.
the corresponding results were r = 0.940 and AUC A = 0.861). That is, the
correlation coefficient for the time-course was only slightly better, while the
AUC for the spatial map was improved. This can be attributed to the use of
the more adaptive model on the wavelet coefficients of the spatial maps and
the inclusion of the state noise model.
179
Chapter 5
Variational Bayesian Learning
for Sparse Matrix Factorization
In this Chapter1 we present a fully Bayesian bilinear model for factorizing
data matrices under sparsity constraints with special emphasis on spatio-
temporal data such as fMRI. We exploit the sparse representation of the
spatial maps in a wavelet basis and learn a set of temporal “regressors” that
best represent, i.e. are maximally relevant to, the source timecourses.
5.1 Sparse Representations and Sparse Ma-
trix Factorization
A fundamental operation in data analysis often involves the representation of
observations in a ‘latent’ signal space that best reveals the internal structure
1A shorter version of this Chapter was published as E. Roussos, S. Roberts, and I.Daubechies. Variational Bayesian Learning of Sparse Representations and Its Applicationin Functional Neuroimaging. In Springer Lecture Notes on Articial Intelligence (LNAI)Survey of the State of The Art Series, 2011.
180
of the data. Inversion of the standard linear model of component analysis,
viewed as a data representation model, forms a mapping from the space of
observations to a space of latent variables of interest. In particular, in its use
as a multi-dimensional linear model for source separation, the observation
space, spanned by the canonical basis2 of RD (for a D–dimensional obser-
vation space), is transformed into a space spanned by the columns of the
‘mixing’ matrix. This new basis should, hopefully, reveal an intrinsic quality
of the data. In the linear model for brain activation [172], for example,
the spatio-temporal data, X = X(t,v), where v is a voxel in a brain vol-
ume, Ω ⊂ R3, and t is a time-point (t = 1, . . . , T ), is modelled as a linear
superposition of different ‘activity patterns’:
X(t,v) ≈L∑
l=1
Sl(v)Al(t) , (5.1)
where Al(t) and Sl(v) respectively represent the dynamics and spatial vari-
ation of the lth activity. Our goal is the decomposition of the data set into
spatio-temporal components, i.e. pairs(al, sl)
L
l=1such that the “regressors”
alLl=1 capture the ‘time courses’ and the coefficients slLl=1 capture the ‘spa-
tial maps’ of the patterns of activation. Unlike model-based approaches, such
as the general linear model (GLM), in the ‘model-free’ case [15], addressed
here, both factors must be learned from data, without a-priori knowledge of
their exact spatial or temporal structure. The problem described above is
ill-posed, however, without additional constraints. These constraints should
capture some domain-specific knowledge (and be expressible in mathemat-
ical form). In the context of neuroimaging data analysis, Woolrich et
al. [171] developed a Bayesian spatio-temporal model for functional MRI
employing adaptive haemodynamic response function (HRF) inference and
2The canonical basis, BI, of RD is comprised of vectors,
ed
D
d=1, where ed is the unit
vector (0, . . . , 0, 1, 0, . . . , 0), with 1 in the dth position and 0 elsewhere.
181
noise modeling in the temporal and spatial domain, accommodating noise
non-stationarities in different brain areas.
The main tool for exploratory decompositions of neuroimaging data into
components currently in use is independent component analysis (ICA). As its
name suggests, ICA forces statistical independence in order to derive maxi-
mally independent components. The bilinear decomposition described above
is related to the blind source separation (BSS) problem [125], [15]. Indeed,
blind separation of signals into components is inherently a bilinear decom-
position. Nevertheless, many component analysis algorithms for solving the
BSS problem, such as classical “independent” (least-dependent) component
analysis, make various simplifying assumptions in order to solve the mul-
tiplicative ill-posedness of bilinear decomposition [87], [148]. Infomax ICA
and FastICA, the two most popular ICA algorithms for fMRI data analysis,
restrict the problem to the noiseless, square case and estimate an unmixing
matrix using an optimization criterion that maximizes statistical indepen-
dence among the components, or, in practice, appropriate proxy functionals
[16], [89]. This matrix is then fixed and used to estimate the ‘sources’ by
operating directly on the observation signals. Probabilistic ICA (PICA) [15],
a state-of-the-art model developed for the analysis of fMRI data, first reduces
the dimensionality of the problem by estimating a lower-dimensional signal
subspace using Bayesian model order selection for PCA [126], regarding the
additional dimensions as noise, and then performs ICA on it.
However, despite its success, there are both conceptual and empirical is-
sues with respect to using the independence assumption as a prior for brain
data analysis [123], [46]. In particular, as discussed previously in this Thesis,
there is no physical or physiological reason for the components to correspond
to different activity patterns with independent distributions. Recent exper-
imental and theoretical work in imaging neuroscience [46], [30] reveals that
activations inferred from functional MRI data are sparse, in some appropri-
182
ate sense. In fact, in Daubechies et al. [46] the key factor for the success of
ICA-type decompositions, as applied to fMRI, was identified as the sparsity
of the components, rather than their mutual independence. We shall exploit
this sparseness structure for bilinear decomposition.
In matrix form, the problem of sparsely representing the data set Xbecomes a problem of sparse matrix factorization of its corresponding data
matrix, X (Srebro and Jaakola [159], Dueck et al., [55])3. “Unfolding” the
volume of each scan,X(t,v)
v∈V , at time-point t, as a long vector4 xt ∈RN , where N = |V| is the total number of voxels, and storing it into the
corresponding row of X, for all t = 1, . . . , T , the aim is to factorize the
resulting T ×N data matrix,
XT×N
=
x1,1 · · · x1,N
.... . .
...
xT,1 · · · xT,N
,
into a pair of matrices, (A,S), plus noise (see also Fig. 5.1):
XT×N≈ A
T×L· S
L×N. (5.2)
The set alLl=1 of prototypical timeseries contains the basis and spans the
column space of the data matrix. The nth column of S, sn, contains the
coefficients sl,nLl=1, which combine these vectors in order to ‘explain’ the
nth column of X. Dually, the lth row of S, sT
l , contains the lth spatial
3Srebro and Jaakola and Dueck et al. consider the sparse representation case in whichwe seek a minimal number of representation coefficients. This corresponds to an ℓ0 penalty.We do not restrict ourselves to this case in this Chapter, but rather learn the penaltyadaptively from the data.
4The unfolding operation, V 7→ Ix, where V is a volume and Ix denotes the set of linearindices of the vector x, is a bijection. We can always recover the volume-valued data usingthe inverse operation.
183
Xt =X(t,v)
v∈Ω
xT
t
xn
t t
nn
alLl=1
aT
t
sl,nAl(t) Sl(v)
sn
sT
l
L
l=1
TTL
L NN1
111
11
l
l
≈X AS
...
Figure 5.1: Matrix factorization view of component analysis, as a bilineardecomposition model for spatio-temporal data. The T ×N observation ma-trix, X, is decomposed into a pair of matrices, (A,S), plus noise, such thatthe matrix A contains the dynamics of the components, Al(t)Ll=1, in its
columns and the matrix S the spatial variation, Sl(v)Ll=1, in its rows. BothA and S are to be estimated by the model. Our aim is to perform a sparsematrix factorization of X, such that both the basis set, alLl=1, and theresultant maps, sl, for all l, are sparse in some appropriate sense.
map. Our objective is to invert the effect of the two factors, coupled by
Eq. (5.2). In addition, if the data lies in a lower-dimensional subspace of
the observation space, that is to say, the number of latent dimensions is less
than the number of observation dimensions, L < minT,N, as is very often
the case in real-world datasets, then the method should also seek a low-rank
matrix factorization.
The sparse decomposition problem, modelled as a blind source separa-
tion problem, can in some cases be solved by algorithms such as independent
component analysis, viewed as a technique for seeking sparse components,
assuming appropriate distributions for the sources. Indeed, as shown by
Daubechies et al. [46], ICA algorithms for fMRI work well if the compo-
nents have heavy-tailed distributions. However, ICA approaches are not
optimized for sparse decompositions. As also demonstrated in [46], analyses
based on the concept of mathematical independence are not well adapted
for fMRI data. Surprisingly, ICA failed to separate components in experi-
ments in which they were designed to be independent. Indeed, the key factor
184
for the success of ICA-type decompositions, as applied to fMRI, was iden-
tified as the sparsity of the components, rather their mutual independence.
More importantly, as previously said, from a neuroscientific point of view
there is no physical or physiological reason for the spatial samples to cor-
respond to different activity patterns Sl(v) with independent distributions.
The complexity of the observed brain processes is highly unlikely to be due
to underlying causes that are independent, in the mathematical definition of
the term. There is no physical reason why these latent processes should not
interact in complex and unforseen ways.
We propose to construct a matrix factorization model that explicitly op-
timizes for sparsity, rather than independence. Several neuroscientific as
well as mathematical results support the choice of the sparsity constraint.
As pointed out by Li, Cichocki and Amari [112], [111], sparse represen-
tation can be very effective when used in blind source separation. Sparse
representations is also an area of considerable recent research in the machine
learning and statistics communities, with algorithms such as basis pursuit
[32], LASSO [164], and elastic net [178].
In the field of vision, Field studied the relationship between the statistics
of natural scenes and coding of information in the visual system [63]. He
argues that the brain has evolved strategies for efficient coding by exploiting
the sparse structure available in the input. In particular, it optimizes the rep-
resentation of its visual environment by performing a sparse representation
of natural scenes in overcomplete, wavelet-like “bases” [137]. We continue on
this theme of sparse representation. Our problem is different, however: we
want to extract whole images (see also Donoho, [50]) from spatio-temporal
data, and not local filters, and the bases are time-courses (and spatial-maps)
in our case. Carroll et al. [30] use the elastic net algorithm for predic-
tive modelling of fMRI data with the aim to obtain interpretable activations.
The built-in sparsity regularization of the elastic net facilitates robust, i.e.
185
repeatable, results that are also localized in nature.
Sparsity can be modelled and imposed by choosing appropriate pri-
ors. Many authors (see for example [111], [154]) choose the Laplace or
bi-exponential prior, P (s) ∝ e−β|s|, corresponding to an ℓ1 penalty on the
values of the latent variables (sources), based on convexity arguments. A
classical mathematical formulation of the sparse decomposition problem is
then to setup an optimization functional such as [112]
I = ‖X−AS‖2 + λS
L∑
l=1
N∑
n=1
|sl,n|+ λA‖A‖2 ,
containing an ℓ1 penalty/prior, enforcing sparse representations, where λS,
λA are regularization parameters. Several important results have been ob-
tained for sparse linear models. Donoho and Elad (2003) [51] give an elegant
proof for the equivalence of the ℓ1 and ℓ0 norms and the conditions under
which it holds using a deterministic approach. Under this framework, how-
ever, the signals have to be highly sparse (e.g. having less than five nonzero
entries) for arbitrary, non-orthogonal bases [111]. This can be too strict an
assumption for real-world signals. Li et al. (2004) extend their results from a
mathematical statistics perspective. However, an ℓ1–style regularization may
not always be flexible enough as a penalty in practice, as it is not adaptive
to the sparsity of the particular latent signals at hand. Adaptive estimation
may lead to very sparse representations. We discuss several alternatives in
section 5.2.2.
When both the basis and the hidden signals are unknown, the problem
becomes harder. Several dictionary-learning algorithms for sparse sepresenta-
tion have been proposed around the ‘FOCal Underdetermined System Solver’
(FOCUSS) algorithm, leading to sparsity by optimizing the ‘diversity’ mea-
sure, which the number of non-zero elements in a sequence. In particular,
186
Kreutz-Delgado et al. [100] improve on this idea and provide maximum likeli-
hood and maximum a-posteriori algorithms using the notion of Schur concav-
ity, i.e. concavity combined with permutation invariance. However, generally
a locally optimal solution can be obtained, as pointed out by Li et al. [111].
In Li et al. [112] learning of the basis, B = al, was performed as an ex-
ternal step, via the k–means algorithm. A solution was then obtained using
linear programming. Uncertainties in the model parameters and model selec-
tion issues were not given full consideration. More realistic models should
include a way for handling noise and uncertainty, however, and seamlessly
fuse information from other parts of the model. In this paper we approach
the problem from a Bayesian perspective and propose a fully Bayesian hier-
archical model for sparse matrix factorization.
Bayesian inference provides a powerful methodology for machine learn-
ing by providing a principled method for using our domain knowledge and
allowing the incorporation of uncertainty in the data and the model into
the estimation process, thus preventing ‘overfitting’. The outputs of fully
Bayesian inference are posterior probability distributions over all variables
in the model.
A key property characterizing sparse mixtures is the concentration of
data points around the directions of the mixing vectors [176]. This phe-
nomenon is depicted in Fig. 5.2. It is this property that allows some algo-
rithms to exploit clustering in order to estimate those ‘unmixing’ directions
from the data (see e.g. Li et al. [111]). However, this property of the data
may not be immediately evident, as the sparsity of the observations often
only emerges in a different basis representation. In this text we consider
an extension to the view of analysis into components as data representa-
tion presented above by forming a representation to an intermediate, feature
space (Zibulevsky et al., [177]; Li et al., [111]; Turkheimer et al., [168]). This
is in essence a structural constraint, consistent with the Bayesian philoso-
187
Figure 5.2: Geometric interpretation of sparse representation. State-spacesand projections of two-dimensional datasets are shown. For visualizationpurposes, both observation and latent dimensionalities are equal to 2 in thisfigure. Left: non-sparse data. Right: Sparse data. Sparse data mapped inthe latent space produce a heavy-tailed distribution for both axes. (Figurefrom D. J. Field, Phil. Trans. R. Soc. Lond. A, 1999.)
phy of data analysis, in the sense that the features/building blocks (also
called ‘primitives’, in pattern recognition terminology) match our notion of
the desired properties of the sought-for signals. The model we consider here,
therefore, is a sparse linear model in which also the observations are affected
to have a sparse representation in an appropriate signal dictionary. We derive
a “hybrid”, feature-space sparse component analysis model, transforming the
signals into a domain where the modeling assumption of sparsity of the ex-
pansion coefficients with respect to the dictionary is natural [176], [91]. The
above idea gives us the advantage to derive a component analysis algorithm
that enforces leptokurtic5 (also known as super-Gaussian) distributions on
5The kurtosis of a probability density is a shape descriptor of the density and, inparticular, a measure of its “peakedness”. It can be defined in terms of its moments, as
µ4
(σ2)2, where µ4 = E
(x − µ)4
, σ2 = E
(x− µ)2
, and µ = E x. From this, the excess
188
the sources. Utilizing the central limit theorem, we can be sure that since
the observations (i.e. intermediate variables here) are described by highly-
peaked distributions centered at zero, the “true” sources will be described
by even more highly-peaked ones. In addition to the sparsity constraint
on the spatial maps, we will seek a parsimonious representation in the sense
that we will require the set of ‘active’ (relevant) regressors describing each
dataset to be minimal.
We derive a sparse decomposition algorithm viewing bilinear decomposi-
tion as a Bayesian generative model. In contrast to most component analysis
models for fMRI, where an unmixing matrix is learned and from there the
sources are deterministically estimated, our model solves a bilinear problem,
simultaneously performing inference and learning of the basis set and the
latent variables. By adopting a fully Bayesian approach, this component
analysis model is a probabilistic model proper: it describes the statistics of
the observation and source signals, instead of some instances of those sig-
nals. We employ hierarchical source and mixing models, which result in
automatic regularization. This allows us to perform model selection in order
to infer the complexity of the decomposition, as well as automatic denoising.
The model also contains an explicit noise model, making the relation of the
sources to the observations stochastic. The benefit of this is that obser-
vation noise is prevented from “leaking” into the estimated components, by
effectively utilizing an implicit filter automatically learned by the algorithm.
We follow a graphical modelling formalism viewing sparse matrix fac-
torization as a probabilistic generative model. Graphical models are repre-
sentations of probability distributions utilizing graph theory, with semantics
reflecting the conditional probabilities of sets of random variables.
In many modern real-world applications of machine learning, such as
kurtosis, µ4
(σ2)2− 3, can be computed. The excess kurtosis of the Gaussian distribution is
zero. A distribution with positive excess kurtosis is called leptokurtic.
189
biosignal analysis or geoscientific data processing, the observation matrix of-
ten comprises of very large and high-dimensional datasets, e.g. observation
dimensions on the order of a few hundreds and hundreds of thousands of
samples. This makes data decomposition/separation very computationally
demanding. A tradeoff should be made, therefore, between mathematical
elegance and computational efficiency. Here, we propose a practical model
under a fully Bayesian framework that captures the requirement of sparse
decomposition while providing analytic expressions for the estimation equa-
tions. Since exact inference and learning in such a model is intractable,
we follow a variational Bayesian approach in the conjugate-exponential fam-
ily of distributions, for efficient unsupervised learning in multi-dimensional
settings, such as fMRI6. The sparse bilinear decomposition algorithm also
provides an alternative to the LASSO [164] and to linear programming or gra-
dient optimisation techniques for blind linear inverse problems with sparsity
constraints.
This text is structured as follows. We first introduce SMF as a general
linear decompositional model for spatio-temporal data under view (1). Then
we present the hybrid wavelet component analysis model emphasizing view
(2) and various ways of modelling sparsity in a graphical modelling frame-
work, as well as the details of inference in this model. Finally, we present
some representative results followed by conclusions and discussion.
5.2 Bayesian Sparse Decomposition Model
We start by forming a representation to an “intermediate” space spanned
by a wavelet family of localized time-frequency atoms. The use of wavelets
in neuroimaging analysis has become quite widespread in recent years, due
6See, however, the article of Woolrich et al. [171] for an alternative fully Bayesianalgorithm, using MCMC sampling.
190
to the well-known property of wavelet transforms to form compressed, mul-
tiresolution representations of a very broad class of signals. Sparsity with
respect to a wavelet dictionary means that most coefficients will be “small”
and only a few of them will be significantly different than zero. Furthermore,
due to the excellent approximation properties of wavelets, in the standard
‘signal plus noise’ model, decomposing data in an appropriate dictionary will
typically result in large coefficients modelling the signal and small ones cor-
responding to noise. In addition, the wavelet transform largely decorrelates
the data. The above properties should be captured by the model.
Following Turkheimer et al. [168], we perform our wavelet analysis in the
spatial domain; for examples of the use of wavelets in the temporal dimension
see, for example, Bullmore et al. [21]. Roussos et al. [152], [153] and Flandin
and Penny [66] have also used spatial wavelets for decomposing fMRI data
using variational Bayesian wavelet-domain ICA and GLMs, respectively. The
representation of a signal, f , is formed by taking linear projections of the data
on the atoms, ψλ, of the dictionary, Φ, which, for the sake of simplicity, is
assumed for now to be orthonormal and non-redundant (consequently, Φ−1 =
ΦT) . These projections are the inner products, that is the correlations, of
the corresponding signals (functions) with each ψλ ∈ Φ, ∀λ. Then, the
wavelet expansion of the signal is f = Φw, where wλ = 〈f ,ψλ〉 is the λth
wavelet coefficient and 〈·, ·〉 here denotes the inner product. Wavelets form
unconditional bases7 for a class of smoothness function spaces called Besov
spaces. By transforming the signals in wavelet space, and regarding small
wavelet coefficients as “noise”, we essentially project signals in smoothness
space (Choi and Baraniuk, [34]). Using this representation for all signals
7A basis gk is unconditional if the series∑
k wkgk converges unconditionally (forfinite wk). The unconditionality property allows us to ignore the order of summation.
191
in the model, we get the noisy observation equation in the wavelet domain
X = AC + E , (5.3)
where the matrix[xT
t
]T
t=1denotes the transformed observations,
[cT
l
]L
l=1the
(unknown) coefficients of the wavelet expansion of the latent signals[sT
l
]L
l=1,
and E ∼ N (0T×N , (R−1IT , IN)) is a Gaussian noise process. Now separation
will be performed completely in wavelet space. After inference, the compo-
nents will be transformed back in physical space, using the inverse transform.
Furthermore, we can use any of the standard wavelet denoising algorithms
in order to clean the signals. We note, however, that as we have an explicit
noise process in the component analysis model, small wavelet coefficients are
explained by the noise process rather than included in any reconstructed
source. We thus achieve a simultaneous source reconstruction and denoising.
5.2.1 Graphical modelling framework
The probabilistic dependence relationships in our model can be represenented
in a directed acyclic graph8 (DAG) known as a Bayesian network. In this
graph, random variables are represented as nodes and structural relationships
between variables as directed edges connecting the corresponding nodes. In-
stantiated nodes appear shaded. Repetition over individual indices is denoted
by plates, shown as brown boxes surrounding the corresponding random vari-
ables. The graphical representation offers modularity in modelling and effi-
cient learning, as we can exploit the local, Markovian structure of the model,
captured in the network, as will be shown next.
DAGs are natural models for conditional independence. In particular,
8A directed acyclic graph is a graph with no cycles. A cycle is a path, i.e. a sequenceof nodes, in which the end points are allowed to be the same.
192
let the random variables in a probabilistic model belong to a set V. Our
goal is to compute the joint probability p(V). Each arc, which is an ordered
pair (v, u) ∈ E ⊆ V × V, represents a dependence relation between v and u,
denoted by an arrow: u −→ v. A graph, G, is then the pair (V, E). According
to our probabilistic specification, a variable v ∈ V may depend on a subset
of variables in V, called its ‘parents’ and denoted by pa(v) ⊆ V. The parent
set is the smallest set for which p(v|V \ v) = p(v|pa(v)). We endow the
DAG with probabilistic semantics by identifying the dependence relation
with the conditional probability p(v|pa(v)). Each individual parent-child
relation, defined on the ordered pairs (u, v) with u ∈ pa(v) is then an edge in
a Bayesian network. A probability density, p, on V is said to be compatible
with a Bayesian network if it satisfies all independence relationships among
nodes implied by the DAG. The important point here is that there is a natural
topological structure in V, induced by the dependence relation ‘v depends on
pa(v)’. This topological structure allows us to efficiently compute the joint
distribution, p(V), as a structured probability distribution on the DAG, G.In particular, the following directed Markov property holds:
p(V) =∏
v∈Vp(v|pa(v)) , (5.4)
which allows us to factorize p(V) using only local relations: each random
variable “is influenced only by its parents”, or more formally, each node is
conditionally independent from its non-descendents given its parents. Note
that this factorization is exact and follows from the graphical model specifi-
cation; no simplifications were made here.
The generic graphical model for feature-space component analysis in
shown in Fig. 5.3. The latent variables (unknown coefficnents of the “sources”)
are denoted bycl,λL
l=1and the “observations” (i.e. the data transformed in
193
c1,λ cl−1,λ cl,λ cL,λ
x1,λ xt,λ xT,λ
a1 at aT
R
λ = 1, . . . ,Λ· · · · · ·
· · · · · ·
Figure 5.3: The probabilistic graphical model for feature-space component
analysis. The bi-partite graph,cl,λL
l=1−→
xt,λ
T
t=1, where λ indexes
the features, will play an essential role in this model. The brown rectangle
represents a ‘plate’: it captures repetition over λ. The nodesat
T
t=1, where
each at is a vector of the form at = (a1,t, . . . , al,t, . . . , aL,t), represent time-“slices” of the time-courses across all latent dimensions. Finally, R is theinverse noise level on the (transformed) observations.
feature space) byxt,λ
T
t=1, where λ is a particular data point, λ = 1, . . . ,Λ.
The observations are modelled as a linear mixture of the sources, plus noise,
via an unknown linear operator A (whose rows are aT
1 , . . . , aT
t , . . . , aT
T ), such
that xt,λ ≈∑L
l=1At,lclλ.
A point worth emphasizing, which applies to all component analysis prob-
lems, is that the assumption of independence over the latent variables, l,common to many ICA algorithms, refers to the a-priori source distribution,
i.e. the model density p(s), regardless of whether it is explicitly stated or not.
The sources given the data, p(s|x), that is after we have learned the model,
will be coupled a-posteriori. This is the ‘explaining-away’ phenomenon, and it
is probably best seen by observing the directed graph (Bayesian network) that
corresponds to a component analysis model: see Fig. 5.3. The dependency
properties (semantics) of this Bayesian network, i.e. the ‘fan-in’ structure,
imply that, while one may chose a factorized, over l, prior model for the latent
variables, p(s) ,∏
l pl(sl), implying a-priori statistical independence, these
become coupled, i.e. dependent after we obtain the data: p(s|x) 6= ∏l pl(sl|x).
194
Since probabilities must sum to one, the sources slLl=1, “compete” in or-
der to explain the data. While for some applications a factorized posterior
might be sufficient9, for other applications, such as extracting brain processes
from neuroimaging data, the independence assumption might not be valid.
Nevertheless, many ICA algorithms make this additional factorization (naive
mean-field) assumption anyway [102], [36], [128]. However, Ilin and Valpola
[93] discuss the consequences of this assumption. Therefore, in our model we
will employ a non-factorized posterior.
The complete graphical model for VB-SMF shown in Fig. 5.4. To
fully specify the model, the probabilistic specification (priors) for all random
variables in the model needs to be given. The learning algorithm then infers
the wavelet coefficients, cl,λ, ∀l, λ, and learns the time-courses, At,l, ∀t, l,the parameters of the sparse prior on the coefficients, and the noise level.
These four components of the model are shown as dotted boxes in Fig. 5.4.
5.2.2 Modelling sparsity: Bayesian priors and wavelets
Adaptive sparse prior model for the wavelet coefficients. The char-
acteristic shape of the typical empirical histogram of the wavelet coefficients
of the spatial maps,cl,λ, is highly peaked at zero and heavy tailed. We
want to capture this sparsity pattern of the coefficients in probabilistic terms.
A classical prior for wavelets is the Laplacian distribution, corresponding to
an ℓ1 norm, or the generalized exponential (also called generalized Gaussian)
distribution, corresponding to an ℓp norm [91]. Another sparse prior that
has been used is the Cauchy-Lorentz distibution10 [136]. In BIT-SCA, we
9For example, one might claim that in the cocktail party problem the voices of thepeople are generated by passing a white noise process, generated from the lungs, throughthe vocal cords, independently modulating each individual’s output signal.
10The Cauchy distribution has the probability density function f(x; µ, γ) =1/πγ[1 + ((x − µ)/γ)2] where µ is the location parameter, specifying the location of thepeak of the distribution, and γ is the scale parameter specifying the half-width at half-
195
πlm
µlm
βlm
ξlλ
clλ
xtλ
Atl
al
R
L
T
Λ
M
M
απl,m
(bal, cal
)
(mµl,m, vµl,m
)
(bβl,m, cβl,m
)
(bR, cR)
Figure 5.4: Variational Bayesian Sparse Matrix Factorization graphicalmodel. In this graph, nodes represent random variables and edges proba-bilistic dependence relationships between variables. Instantiated nodes ap-pear shaded and nodes corresponding to hyperparameters are represented bysmaller, filled circles. We can think of uninstantiated nodes as “uncertaintygenerators”. Repetition over individual indices is denoted by plates, shownas brown boxes surrounding the corresponding variables. Each plate is la-beled with its corresponding cardinality. If two or more plates intersect, theresulting tuple of indices is in the cartesian product of their respective indexsets. Finaly, each module in the model is shown as a dotted box.
196
0
0.5
1
1.5
2
2.5
-3 -2 -1 0 1 2 3P
DF
p(c
)
wavelet coefficient, c
Figure 5.5: Sparse mixture of Gaussians (SMoG) prior for the coefficientscl,λ. Blue/green curves: Gaussian components; thick black curve, mixturedensity, p(cl,λ).
used the generalized exponential model for wavelet coefficients. While this
is a flexible prior, especially under our model (where we estimate the hy-
perparameters from the data), it does not, unfortunatelly, lead to analytic
expressions. Therefore, we can not use it under the full VB framework.
Our aim is to model a wide variety of sparseness constraints in a tractable
(analytic) way and at the same time derive an efficient implementation of our
method. In order to achieve this, we use distributions from the conjugate-
exponential family of distributions. The conjugacy property means that the
chosen priors are conjugate to the likelihood terms, leading to posteriors that
have the same functional form as the corresponding priors. One option is to
use a ‘sparse Bayesian’ prior [166], which corresponds to a Student-t marginal
for each p(cl). However, this leads to an algorithm that requires O(LN) scale
parameters for the wavelet coefficients. In our implementation we enforce
sparsity by restricting the general mixture of Gaussians (MoG) model to be
a two-component, zero co-mean mixture over each set cl,λΛλ=1, for each l
(Fig. 5.6). A two-state, zero-mean mixture model is an effective distribution
for modelling sparse signals, both emprirically and theoretically [34], [41].
maximum (HWHM).
197
πl,m
µl,m
βl,m
ξl,λ
cl,λ
λ
l :
m
m
Figure 5.6: A directed graphical model representation of a sparse mixture ofGaussian (SMoG) densities. In this model we put a separate MoG prior onthe expansion coefficients of the lth spatial map,
cl,λ
λ. The plate over λ
represents the collection(cl,λ, ξl,λ)
λ, of pairs of state and wavelet coefficient
values for the coefficients of lth source, for each “data point” λ. The platesover m represent the parameters of the set of M Gaussian components, θcl
=
µl,m, βl,m
m
⋃
πl,m
m, which collectively model the non-Gaussian density
of the coefficients of the source l using Eq. (5.5). The means, µl,m, arenon-zero for the scaling coefficients only; for the wavelet coefficients, we setµl,m
.= 0.
The wavelet coefficients have respective state variables ξl,λΛλ=1. The
parameters of the component Gaussians are the means µl,m and the precisions
βl,m, i.e. p(cl,λ|ξl,λ, µl,m, βl,m) = N (cl,λ;µl,m, βl,m), and the associated mixing
proportions of the MoG are πl,m, where m denotes the hidden state of the
mixture that models the lth source. These form a concatenated parameter
vector θclindexed by m = 1, . . . ,M (with M
.= 2 here). The mixture
density of the expansion coefficients of the lth spatial map is then given by
p(cl,λ|θcl) =
M∑
m=1
p(ξl,λ = m|πl)p(cl,λ|ξl,λ,µl,βl) . (5.5)
198
Since we are interested in sparsity, we set µlm.= 0, ∀l,m, for the wavelet
coefficients (but not for the scaling ones11), a-priori. The prior probability
of the state variable ξl,λ being in the mth state is p(ξl,λ = m|πl) = πl,m. The
joint prior on the wavelet coefficients given ξ is therefore:
p(C|ξ,µ,β) =L∏
l=1
[Λ∏
λ=1
p(cl,λ|ξl,λ, µlm, βlm)
]
=L∏
l=1
[Λ∏
λ=1
M∑
m=1
πl,mN (cl,λ;µl,m, βl,m)
]
(5.6)
and the prior on the hidden state variables, ξ =ξl
L
l=1, where ξl = (ξl,1, . . . , ξl,Λ),
is
p(ξ = m|π) =∏
l,λ
p(ξl,λ = m|πl) =∏
l,λ,m
πl,m . (5.7)
The assignment of hyperpriors is very important for capturing the sparsity
constraint. The prior hyperparameters of the two components, have zero
mean and hyperpriors over the precisions such that one component has a
low precision, the other a high precision. These correspond to the two states
of the wavelet coefficients, ‘large’ (carrying signal information) and ‘small’
(corresponding to “noise”). Figure 5.5 depicts this scheme. We assign a
Gaussian hyperprior on the position parameters µl,
p(µl) =M∏
m=1
N(µl,m;mµl,0
, vµl,0
−1), (5.8)
a Gamma on the scale parameters βl,
p(βl) =
M∏
m=1
Ga(βl,m; bβl,0, cβl,0
) , (5.9)
11The scaling coefficients can be thought of as the result of a low-pass type filteringand represent a smoothed version of the original signal. These are not necessarily sparse.Therefore, we allow a general MoG prior for them. Another option is to put a flat prioron them.
199
where we use the parameterization x ∼ Ga(x; b, c) = 1Γ(b)bcx
c−1e−x/b, and a
Dirichlet on the mixing proportions πl,
p(πl) =M∏
m=1
Di(πl,m;απl,0) . (5.10)
Note that the sparse MoG (SMoG) model parameters are not fixed in
advance, but rather they are automatically learned from the data, adapting
to the statistics of the particular spatial maps. In particular, the parameters
β are not fixed but are allowed to range within the “fuzzy” ranges (‘small’,
‘large’) based on the definition of their hyperparameters12 (bβ, cβ). Using
this flexible model we can adapt to various degrees of sparsity and different
shapes of prior.
5.2.3 Hierarchical mixing model and Automatic Rele-
vance Determination.
For the entries of the factor A we choose a hierarchical prior that will allow
us to select the most relevant column vectors to explain a particular data
set, X . In particular, the prior over the timecourses, alLl=1, is a zero-mean
Gaussian with a different precision hyperparameter al over the lth column
vector, al: p(al|al) = N(al; 0T×1, a
−1l IT
), for l = 1, . . . , L. Therefore, the
joint prior over the bases is
p(alLl=1|a
)=
L∏
l=1
p(al|al) =
L∏
l=1
N(A:,l; 0T×1, a
−1l IT
), (5.11)
where A:,l denotes the lth column vector of entries for the time-points t =
12Also note that for b0β → 0 and c0
β →∞, the Gamma prior becomes the uninformative
prior p(β) ∝ 1β .
200
1, . . . , T and the components of the precision a = (a1, . . . , al, . . . , aL), cor-
responding to the column vectors[al
]L
l=1of the mixing matrix (and, conse-
quently, to the spatial mapssl
L
l=1), are the same for all entries: atl = al, ∀t.
The prior over each al is in turn a Gamma distribution, p(al) = Ga(al; bal, cal
).
This hierarchical prior leads to a sparse marginal distribution for alLl=1 (a
Student-t, which can be shown if one integrates out the precision hyperpa-
rameter, al). By monitoring the evolution of the al, the relevance of each
time-course may be determined; this is referred to as Automatic Relevance
Determination (ARD) [115]. This allows us to infer the complexity of the
decomposition and obtain a sparse matrix factorization in terms of the time-
courses as well, by suppressing irrelevant sources.
5.2.4 Noise level model
Finally, the prior distribution over the precision, R = RIT , of the noise
process, E, (i.e. the inverse of the noise level) is a gamma density with
parameters (bR, cR):
p(R|bR, cR) = Ga (R; bR, cR) =1
Γ (cR) (bR)cRRcR−1e−1/bRR . (5.12)
5.3 Approximate Inference by Free Energy
Minimization
5.3.1 Variational Bayesian Inference
In the section above, we stated a generative model for sparse bilinear decom-
position. Let us now collect all unknowns in the set U =π,µ,β, ξ,C,a,A,R
.
Exact Bayesian inference in such a model is extremely computationally inten-
201
sive and often intractable, since we need, in principle, to perform integration
in a very high dimensional space, in order to obtain the joint posterior of Ugiven the data, p(U|X). Instead, we will use the variational Bayesian (VB)
framework [7], for efficient approximate inference in high-dimensional set-
tings, such as fMRI. The idea in VB is to approximate the complicated exact
posterior with a simpler approximate one, Q(U), that is closest to p(U|X)
in an appropriate sense, in particular in terms of the Kullback-Leibler (KL)
divergence. The optimization functional in this case is the (negative) varia-
tional free energy of the system:
F(Q,X ) =⟨log p(X ,U)
⟩+H
[Q(U)
], (5.13)
where the average, 〈·〉, in the first term (negative variational ‘energy’) is over
the variational posterior, Q(U), and the second term is the entropy of Q(U).
The negative free energy forms a lower bound to the Bayesian log evidence,
i.e. the marginal likelihood of the observations. This is probability of the ob-
servations being instantiated at a particular value, after all latent variables
and parameters in the model have been integrated out. Maximizing the
bound minimizes the “distance” between the variational and the true poste-
rior. Using this bound we derive an efficient, variational Bayesian algorithm
for learning and inference in the model that provides posterior probability
distributions over all random variables of the system, i.e. the source densities,
mixing matrix, source model parameters, and noise level.
We choose to restrict the variational posterior Q(U) to belong to a class
of distributions that are factorized over subsets of the ensemble of variables:
Q(U) =
(Λ∏
λ=1
Q(cλ)Q(ξλ)
)(
Q(π)Q(µ)Q(β))(
Q(A)Q(a))
Q(R) . (5.14)
The above factorization incorporates the standard variational approximation
202
between the latent variables and parameters,
p(
cλ, ξλ
Λ
λ=1; θ∣∣∣X)
≈ Q(θ)Λ∏
λ=1
Q(cλ, ξλ) . (5.15)
That is, the distribution of latent variables factorizes across the λ–plate.
However here, unlike e.g. [152] or [149], we will employ variational posteriors
that are coupled across latent dimensions for the wavelet coefficients of the
spatial maps and the time-courses13.
Performing functional optimization with respect to the distributions of
the unknown variables, ∂∂Q(u)
F(Q,X ) = 0, for all elements u of U , we obtain
the optimal form for the posterior:
u ∈ U : Q(u) ∝ exp
(⟨
log p(X,U)⟩
Q(U\u)
)
. (5.16)
This results in an system of coupled equations, which are solved in an iterative
manner. Theoretical results [7] show that the algorithm converges to a (local)
optimum. Since we have chosen to work in the conjugate-exponential family,
the posterior distributions have the same functional form as the priors and
the update equations are essentially “moves”, in parameter space, of the
parameters of the priors due to observing the data. It turns out that the
structure of the equations is such that, due to the Markovian structure of
the DAG, only information from the local “neighborhood” of each node is
used. Exploiting the graphical structure of the model, the resulting update
equations depend only on posterior averages over nodes belonging to the
Markov blanket of each node. This is the set comprising of the parents, the
children, and the co-parents with respect to the particular node at hand.
13ICA decompositions can also be sparse, if sparsity-enforcing source models are used.However, the SCA decompositions presented here are non-independent. (We have alreadyshown the difference between a-priori and a-posteriori coupling in the bi-partite graphicalmodel for CA.)
203
We next show the update equations for the wavelet coefficients of the spatial
maps and the time-courses.
5.3.2 Posteriors for the variational Bayesian sparse ma-
trix factorization model
Inferring the wavelet coefficients of the spatial maps, C
Equation (5.3) combined with Eq. (5.15) leads to a system of equations, one
for each column, λ, of length L, linking the data and source vectors
λ : xλ = Acλ + ελ =∑
l
cl,λal + ελ, λ = 1, . . . ,Λ = |Φ| , (5.17)
where cλ is the vector of the expansion coefficients cλ = (c1,λ, . . . , cL,λ). This
vector is represented as the second layer of nodes in the graphical model of
Fig. 5.3. In the (deterministic) solution of the sparse model of Li et al. [111]
via linear programming, this has the rather convenient consequence that
we only need to solve N smaller linear programming problems in order to
estimate the sparse coefficients. An analogous result is obtained in the vari-
ational Bayesian model as well. In particular, the variational posterior has a
Gaussian functional form, N(
C; µL×Λ, βΛ×L×L
)
, with mean and precision
parameters for the λth wavelet coefficient vector, cλ, given by:
µλ =(
βλ
)−1[
µλ +⟨AT⟩〈R〉 xλ
]
and (5.18)
βλ = βλ +⟨
ATRA⟩
. (5.19)
The quantities µλ and βλ are ‘messages’ sent by the parents of the node cλ
204
to it and are computed by
µl,λ =M∑
m=1
γlλm 〈βlm〉 〈µlm〉 , βl,λ =M∑
m=1
γlλm 〈βlm〉 . (5.20)
The weighting coefficient γlλm, called ‘responsibility’, encodes the prob-
ability of the mth Gaussian kernel generating the λth wavelet coefficient of
the lth spatial map. It is defined as the posterior probability of the state
variable ξl,λ being in the mth state:
γlλm ≡ Q (ξlλ = m) , m = 1, . . . ,M . (5.21)
This will be given in Eq. (5.25).
The rest of the update equations for the posteriors of the sparse MoG
model, ξ,µ,β,π, take a standard form and are shown next, in Eqns (5.25)
to (5.42).
Learning the time-courses and their precisions,A,a
The variational posterior over the matrix of time-courses, AT×L, is a product
of Gaussians with mean and precision parameters for the tth row of A given
by
at =
[
01×L + 〈R〉(
Λ∑
λ=1
⟨
xtλcT
λ
⟩)](
Γat
)−1
, (5.22)
Γat = diag(〈a〉)
+ 〈R〉(
Λ∑
λ=1
⟨
cλcT
λ
⟩)
. (5.23)
The variational posterior of the precisions a = (al) is given by a product of
205
Gamma distributions with al ∼ Ga(
al; bal, cal
)
with variational parameters
bal=
(
1
ba0
+1
2
T∑
t=1
⟨Atl
2⟩
)−1
, cal= ca0 +
1
2T , (5.24)
for the lth column of A. The posterior means of the precisions al, 〈al〉 = balcal
,
are a measure of the relevance of each timecourse, al.
Inferring the states, ξ, of the wavelet coefficients and learning the
parameters of the wavelet model, θC
The responsibilities, γlλm, i.e. the posterior probabilities that the wavelet
coefficient cl,λ was produced by the mth Gaussian component (captured by
the random event ξl,λ = m, where ξl,λ is the state of the wavelet coefficient),
γlλm = Q (ξl,λ = m), are given by
γlλm =γlλm
Zθcl
, (5.25)
where
γlλm = πl,m ·[
β1/2l,m exp
(
−〈βl,m〉2
⟨
(clλ − µl,m)2⟩
Q(cl,µl)
)]
(5.26)
and
Zθclj=∑
m′
γlλm′ , (5.27)
ensuring that Q(ξl,λ) is a properly scaled probability density. The tilded
quantities in the above equations are the exponential parameters
πlm = e〈log πlm〉Q (5.28)
206
and
βlm = e〈log βlm〉Q . (5.29)
These are computed by
e〈log πl,m〉Q = exp
(
Ψ(απl,m)−Ψ
(∑
m′
απl,m′
))
(5.30)
and
e〈log βl,m〉Q = bβl,mexp(Ψ(cβl,m
)), (5.31)
where Ψ(·) is the digamma function, Ψ(x) , ddx
log Γ(x) = Γ′(x)Γ(x)
. Note that
in equation (5.26), the first factor comes from the parents,πl,m
M
m=1, of
the node ξl,λ, and is related to its prior, while the second factor (in square
brackets) comes from the children, and is related to the ‘likelihood’. In
particular, we can interpret the above expression as the node πl,m “sending
a message” πl,m to the state variable, ξl,λ, while the children and co-parents
sending the expression in brackets.
The wavelet model parameters, θC =π,β,µ
, are updated according
to the following update equations for their corresponding posteriors:
• Variational posterior of the mixing proportions, Q(π):
απl,m= απl,0
+ Λl,m , (5.32)
where Λl,m is given below, in Eq. 5.39.
• Variational posterior of the means, Q(µ): For the scaling coefficients,
we have
mµl,m=
(vµl,m
)−1[
vµl,0mµl,0
+ vcl,mmcl,m
]
, (5.33)
vµl,m= vµl,0
+ vcl,m, (5.34)
207
where the “data-dependent” (children) part, that is the information
coming from the wavelet coefficients cl, is computed as
mcl,m=
cl,mπl,m
, (5.35)
vcl,m= Λl,m 〈βl,m〉 . (5.36)
For the wavelet coefficients, we set µ.= 0. This enforces the sparsity
constraint as well.
• Variational posterior of the precisions, Q(β):
bβl,m=
(1
bβl,0
+1
2Λσ2
cl,m
)−1
, (5.37)
cβl,m= cβl,0
+1
2πl,mΛ . (5.38)
The above update equations make use of the following responsibility-
weighted posterior statistics of the set of wavelet coefficients cl = cl,λ,
πl,m =Λl,m
Λ=
1
Λ
Λ∑
λ=1
γlλm , or Λl,m = πl,mΛ , (5.39)
where Λl,m can be interpreted as a pseudo-count of the number of wavelet
bases that can be attributed to the m–th Gaussian kernel and
cl,m =1
Λ
Λ∑
λ=1
γlλm 〈cl,λ〉 , (5.40)
c2l,m =1
Λ
Λ∑
λ=1
γlλm
⟨cl,λ
2⟩, (5.41)
208
σ2cl,m
=1
Λ
Λ∑
λ=Λ
γlλm
⟨(cl,λ − µl,m)2⟩ , (5.42)
where
cl,m, c2l,m, σ
2cl,m
can be interpreted as the contribution of component
m to the average, second moment, and centered second moment (variance)
of the source l and πl,m, Λl,m are the proportion and number of samples that
are attributed to component m of source l, respectively.
From the above equations we can see that there is a natural hier-
archy in the computations of the update equations. Note first that the
responsibility-weighted statistics, equations (5.39)–(5.42) above, are equiv-
alent to the maximum-likelihood estimates of the mixture model via the
EM algorithm, the difference being that we use averages of clλ with respect
to its variational posterior here. Referring to the graphical model for the
SMoG, Fig. 5.6, these are the “inputs” that are sent from the children nodes,clλΛ
λ=1, to the MoG model for the lth source, and they are the sufficient
statistics, (〈clλ〉 , 〈clλ2〉)Λλ=1.
Learning the noise model parameter, R
Finally, the noise precision has a Gamma posterior distribution, R ∼ Ga(
R; bR, cR
)
,
with hyperparameters
bR =
[1
bR0
+1
2
⟨
tr((
X−AC)(
X−AC)T)⟩]−1
, cR = cR0 +1
2TΛ .
(5.43)
The node R receives exactly TΛ messages from its children and co-parents,
as expected.
209
5.3.3 Free Energy
From the definition of the negative free energy, we have
F =⟨
log p(U ,X︸︷︷︸
V(G)
)⟩
Q(U)+H
[Q(U)
],
where the set of unknowns is comprised of the latent variables and parame-
ters, U = Y ∪ θ. Using the standard variational factorization,
Q(U) = Q(Y)Q(θ) ,
and the Markovian structure of the graph, G, with node-set V(G), where
each node, v, has a probability distribution conditioned on its parents,
G→ (p,G) ⇒ p(V(G)) =∏
v∈Vp(v|pa(v)) (Markov) ,
the free energy takes the following equivalent forms
F =⟨
log p(X ,Y|θ)⟩
Q(Y ,θ)+⟨
log p(θ)⟩
Q(θ)+H
[Q(Y)
]+H
[Q(θ)
](5.44)
and
F =⟨
log p(X ,Y|θ)⟩
Q(Y ,θ)+H
[Q(Y)
]+ KL(Q(θ)||p(θ)) , (5.45)
formulated in terms of the entropies of the latent variables and the KL di-
vergencies of the parameters. In a Bayesian network formulation of a prob-
abilistic model, each node gives a term in F .
Applying the above to the Bayesian network of the VB-SMF model, shown
in Fig. 5.4, we obtain the negative free energy of our model. Note that, after
210
the algorithm has converged, several terms cancel out (see, for example, [37]).
The NFE is then:
F = F(X,R) + F(C,ξ) + F(A,a) + FSMoG , (5.46)
where
FX,R = −1
2TΛ log 2π + T
[
log
(Γ(cR)
Γ(c0R)
)
+ cR log bR − c0R log b0R
]
(5.47)
F(C,ξ) =
Λ∑
λ=1
1
2L− 1
2log∣∣∣βλ
∣∣∣
−L∑
l=1
Λ∑
λ=1
M∑
m=1
γl,λ,m log γl,λ,m (5.48)
F(A,a) =
T∑
t=1
1
2L− 1
2log∣∣∣Γat
∣∣∣
+
L∑
l=1
log
(Γ(cal
)
Γ(c0a)
)
+ callog bal
− c0a log b0a
(5.49)
FSMoG =L∑
l=1
M∑
m=1
log
(
Γ(cβl,m)
Γ(c0βl)
)
+ cβl,mlog bβl,m
− c0βllog b0βl
+(5.50)
L∑
l=1
M∑
m=1
log
(Γ(απl,m
)
Γ(α0πl
)
)
− log
Γ(∑M
m=1 απl,m
)
Γ(∑M
m=1 α0πl
)
.(5.51)
The negative free energy is used, apart from deriving the update equations,
to monitor the convergence of the algorithm.
5.4 Results
In this Section we present results from using three real fMRI data sets: (1)
a single subject (epoch) block-design auditory dataset14, (2) a block-design
14From G. Rees, K. Friston and the FIL methods group, Wellcome Trust Centre for Neu-roimaging, University College London, www.fil.ion.ucl.ac.uk/spm/data/auditory/).
211
auditory-visual data set15, and (3) the block-design visual dataset used by
Friston, Jezzard, and Turner in [71] and described in Sect. 3.4.3. These
data sets were used to compare the performance of our model and conven-
tional methods. In particular, results from GLMs were used as the ‘golden
standard’, providing the reference function for the time-courses, while the
Probabilistic ICA model of Beckman and Smith [15] was used as the state-
of-the-art for “model-free” analysis.
5.4.1 Auditory stimulus data set
The task and data aquisition are described next. Auditory stimulation was
with bi-syllabic words presented binaurally at a rate of 60/min. The con-
dition for successive blocks alternated between rest and auditory stimula-
tion, starting with rest. Whole-brain BOLD/EPI images were acquired
on a modified 2.0T Siemens MAGNETOM Vision system and each acqui-
sition consisted of 64 contiguous slices of size 64× 64 voxels with voxel size
3mm × 3mm × 3mm. A total of 96 acquisitions were made in blocks of 6
with the scan-to-scan repeat time set to 7s (TR = 7s), giving 16 blocks of 42s
each. For our analysis, the first four scans were then discarded as “dummy”
scans, due to T1 effects, and from the rest, 84 scans were used.
We run the variational Bayesian sparse matrix factorization (VB-SMF)
model on the dataset in order to detect ‘consistently task related’ (“pre-
dictable”) spatio-temporal components (CTRs). Fig. 5.7 shows the spatial
map and corresponding time-course (red curve), which was selected as the
column of the mixing matrix that best matched the expected time-course
(black curve). The latter was constructed by convolving the experimen-
15From the Oxford Centre for Functional MRI of the Brain,www.fmrib.ox.ac.uk/fsl/fsl/feeds.html).
212
-1.5
-1
-0.5
0
0.5
1
1.5
0 20 40 60 80 100
VBSCAPICAGLM
0
5
10
15
20
Figure 5.7: Time course (left) and corresponding spatial map (right)resulting from applying the variational Bayesian sparse matrix factor-ization model to the auditory fMRI data set of Rees and Friston(www.fil.ion.ucl.ac.uk/spm/data/auditory/). Timecourses: red curve:our model; green curve: PICA; black curve: reference curve from GLM. Theheight of the timecourses on the y–axis were normalized to best match theGLM reference function and range from −1 to 1, approximately. The spatialmap is the raw result from the model (no thresholding post-processing wasperformed). Image intensities range from 0 to about 20.
tal design with the canonical haemodynamic response function from SMP16.
The figure shows that the model correctly identified the neural activation in
the auditory area. The result from VB-SMF was compared with applying
the Probabilistic ICA model on the same data set (green curve). It can be
seen that the VB-SMF better matches the expected timecourse. In partic-
ular, the correlation coefficients were rVB−SMF = 0.801 and rPICA = 0.778,
respectively.
16This has the general form (Friston, [70])
h(t) = α(t/d1)a1e−(t−d1)/b1 − c(t/d2)
a2e−(t−d2)/b2 ,
and is designed such that it takes various neuronal parameters into account (such as theonset, the length of kernel, the delay of response relative to the onset, the dispersion ofresponse, the delay of undershoot relative to the onset, the dispersion of undershoot, andthe ratio of response to undershoot) in a mathematically convenient way. The defaultparameters (0, 32, 6, 1, 16, 1, 6) were used here.
213
5.4.2 Auditory-visual data set
We then tested the sparse decomposition model on the well-known auditory-
visual fMRI data set provided with the FSL FEEDS package [158], used as
benchmark. The data set contains 45 time-points and 5 slices of size 64× 64
voxels each. It was specifically designed as such in order to make it more
difficult to detect the responses. We run our model on the dataset in or-
der to detect ‘consistently task related’ (CTR) components. We applied the
standard preprocessing steps (motion correction, registration, etc.), but no
variance normalization or dimensionality reduction. For each of the sepa-
rated components, we computed the correlation coefficient, r, between the
associated timecourse, al, and the ‘expected timecourses’, which were the
canonical explanatory variables (EVs) from FEAT. After convergence, the
model inferred only L = 3 components with r > 0.3. The component with
the highest value of r was identified as the CTR map. A strong visual and a
strong auditory component were extracted by the model; these are shown in
Fig. 5.8. The correlation coefficients were rvis = 0.858 and raud = 0.764. The
corresponding PICA coefficients from MELODIC were 0.838 and 0.756, re-
spectively. The result of VB-ICA [?], [152] on the same dataset was 0.780 and
0.676, respectively [78]. It is worth noting that the spatial maps extracted
from our model were also much cleaner than both PICA and VB-ICA (not
shown), as displayed in Fig. 5.8. This is due to applying the sparse prior on
the maps.
5.4.3 Visual data set: An Experiment with Manifold
Learning
We now return to the small visual data set of Friston et al. [71], used in
sections 3.4.3 and 4.6.2. The purpose of this experiment is to combine the
214
-2-1.5
-1-0.5
00.5
11.5
2
5 10 15 20 25 30 35 40 45TRs
Auditory stimulus timecourses
-2-1.5
-1-0.5
00.5
11.5
2
5 10 15 20 25 30 35 40 45TRs
Visual stimulus timecourses VB-SR
PICA Canon. EV (GLM)
Auditory spatial map: PICA
Visual spatial map: PICA
Auditory spatial map: VB-SR
Visual spatial map: VB-SR
Figure 5.8: Time courses and corresponding spatial maps resulting fromapplying the variational Bayesian sparse decomposition model to a visual-auditory fMRI data set. Red curve: our model; green curve: PICA; bluecurve: canonical EVs. Note that the maps are the raw results from themodel; no thresholding post-processing was performed.
VB-SMF model for sparse data decomposition with manifold17 learning. Up
to now we have been using data matrices that contained all data-points,
i.e. all scans at time-points t = 1, . . . , T , in their rows, without any dimen-
sionality reduction. In neuroimaging analysis practice, however, using all
observation dimensions is usually impractical and time-consuming. More-
over, as mentioned in Sect. 5.1 and in [15] by Beckmann and Smith, for
most practical purposes, the signal is contained in a lower-dimension sub-
space of the original ambient observation space, the rest containing pro-
cesses that can be characterized as “noise”. However, as recent research
in Machine Learning has shown [114], [83], [42], a better model of the ge-
ometry of high-dimensional datasets is a curved manifold. This approach
was also taken by [173], although the authors did not explicitly stated it as
such. A manifold can potentially capture the intrinsic dimensionality better
than a linear object. Note that we have defined a data point as a whole scan,
xt = X(t,v) : v ∈ V, here. We will continue to be using all voxels and only
17A manifold is a smooth geometric object that can be locally described by a hyperplane,its tangent plane.
215
reduce the dimensionality in the temporal dimension, as is done when using
PCA pre-processing, for example. This essentially means that we implicitly
assume that the activation dynamics can be described by a smaller number of
dynamical parameters than T . As also mentioned in the Introduction of the
Thesis, (spatial) component analysis algorithms applied to spatio-temporal
data can also be interpreted as soft multi-way clustering methods, in the
sense that the observation at each voxel, v, xv = (x1,v, . . . , xT,v), can be
described by very few (ideally just one) learned “regressors”, if these manage
to sufficiently capture the inherent dynamics in the data. Conversely, the ob-
servations projected on to the extracted regressors, visualized in latent space,
should ideally cluster around those vectors. The feature-space SMF model
was defined in a way that did not pre-suppose any particular feature set.
We are free to combine the temporal dimensionality reduction of manifold
learning with the sparse wavelet decomposition in the spatial dimension. We
also note here that dimensionality reduction schemes are fully compatible
with automatic relevance determination (ARD), since ARD is used here to
select the relavance of the L bases, alLl=1. That is, the dimensionality of
the latent space is inferred to be less than or equal to L. In practice, L itself
can result (for high-dimensional problems where we believe that L ≪ D)
from any dimensionality reduction method. This, somewhat pragmatic, ap-
proach was advocated by Neal and Zhang in [143] for their winning entry in
the NIPS 2003 challenge, where PCA was used as dimensionality reduction
pre-processing.
In this experiment, the Locality Preserving Projection (LPP) technique
of He and Niyogi [83] was used, due to its computational simplicity, although
any manifold learning method could, in principle, be used. LPP starts by
regarding each data point x ∈ X = x1, . . . ,xm ⊂ Rn (using the notation
of [83]) as the vertex of a graph, G, which is itself considered as a sample
from a manifold, M, embedded in Rn. The goal is to find a linear map
216
f : M → Rl (l ≪ n), represented by a matrix, A, such that locality is
preserved, i.e. points close together on the manifold (graph) are mapped
close together in the transformed space, Rl. We denote the projected points
by yi = Axi. The graph is built by exploiting neighboring information in the
data set. If xi and xi′ are close, they are considered neighbors, and an edge
is put between them. Connectivity can be defined by various criteria and can
be “topological”, such as taking the k nearest neighbors of each data point,
or “geometrical”, by regarding as neighbors all points lying at a distance
smaller than ǫ. The relative importance of each point is defined by a weight
matrix, W, which can again be defined using a variety of ways, the simplest
being setting Wii′ = 1 if i and i′ are neighbors, to using the heat kernel,
Wii′ = e−1t‖xi−xi′‖. “Closeness” is defined by the Riemannian metric on M
and by the weight W on the graph G. Note that we have implicitly used the
definition of a manifold, as a smooth object locally approximated by a plane,
and the fact that if xi′ is in the neighborhood of xi we can use the Euclidean
distance, ‖ ·− · ‖, in the tangent “plane” (space) El. The projection operator
is found by solving an optimization problem that minimizes the norm of
y = (y1, . . . , ym) under the W metric, ‖y‖W = yTWy,
J =∑
ii′
(yi − yi′)2Wii′ ,
so that connected points stay close together, and using the locality preser-
vation requirement, ‖y‖D = yTDy, which weighs the relative importance of
yi ∈ yi using the diagonal matrix D with Dii =∑
i′ Wii′ , such that the
“roughness” of the map is minimized,
yTDy = 1⇒(aTX
)D(XTa
)= 1 .
Note that this weighs the vector of images, y, as a whole. By its definition,
217
the weight matrix D provides a natural measure on the data points. The
optimization problem then becomes
min
aTXLXTa
subject to aTXDXTa = 1, (5.52)
where L is the Laplacian matrix L = D −W. As in the Laplacian on
a manifold, the discrete graph version presented here penalizes too much
deviation from smoothness. The above optimization problem is equivalent
to solving the generalized eigenvalue problem
(λ, a) : XLXTa = λXDXTa . (5.53)
We would now like to sparsely represent the dataset by applying SMF
in the reduced-dimensionality projection space El defined by the manifold
learning. Note that while the geometric object is nonlinear, the projection
operation of the LPP method is linear. As before, we want to extract the
CTRs from the fMRI data. Using LPP we first reduce the dimensionality of
the dataset in the temporal dimension. We found empirically that a latent
dimension of 5 was sufficient to capture the essential information in the data
for subsequent processing. We then run VB-SMF on the reduced dataset.
The algorithm converged in about 40 iterations. The evolution of the nega-
tive free energy is shown in Fig. 5.9. Examining the variances, a−1l , of the
columns of the mixing matrix, al, it is clear that there is a dominant regres-
sor, while the rest contribute very little in explaining the data. A histogram
of the a−1l Ll=1 is shown in Fig. 5.10. Selecting the component that cor-
responds to the index with the highest variance as the CTR, we obtain the
spatial map and time-course shown in Fig. 5.11. The correlation coefficient
between the VB-SMF extracted timecourse and the reference function from
218
-140000
-139000
-138000
-137000
-136000
-135000
-134000
-133000
0 5 10 15 20 25 30 35 40
Evolution of the free energy, FVB-SCA(MF)
Figure 5.9: Evolution of the negative free energy, F , of the VB-SMF modelrun on the reduced dimensionality data set of Friston et al. [71].
the GLM was rVB−SMF = 0.922. Finally, as in sections 3.4.3 and 4.6.2, we
plot the ROC curve with respect to the spatial map obtained from the GLM
in Fig. 5.12. The ROC power, defined as the area under the curve (AUC)
was AUCVB−SMF = 0.867.
5.5 Discussion
We have presented a sparse representation model incorporating wavelets and
sparsity-inducing adaptive priors under a full Bayesian paradigm. This en-
ables the estimation of both latent variables and basis functions, in a prob-
abilistic graphical modelling formalism. We employed a variational frame-
work for efficient inference. The preliminary results presented here suggest
improved performance compared to other state-of-the-art model-free tools,
such as PICA, while potentially allowing for more interpretable activation
patterns due to the implicit denoising.
219
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 1 2 3 4 5 6
Variance (α-1) of SCA components
Figure 5.10: Variances of the columns of the mixing matrix, a−1l Ll=1, show-
ing that there is a dominant regressor describing the data set. The x-axis isindexed by l and the y-axis ranges from 0 to 0.4.
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
10 20 30 40 50 60
Timecourse a1 from IT-SCA decomposition
SC timecourseExperimental design
Expected CTR timecourse
Figure 5.11: Time course (thick black curve) and corresponding spatial mapresulting from applying the VB-SMF model to the reduced-dimensionalityfMRI data set of Friston et al. [71]. The time-paradigm is shown as the bluecurve and the expected timecourse as the green one.
220
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Figure 5.12: Receiver operating characteristic curve for the spatial map ofthe data set of Friston et al. [71].
221
Chapter 6
Epilogue
In this thesis we have studied the problem of decomposing data into compo-
nents when sparsity of the representation is a key prior assumption. Sparsity,
as an algorithmic paradigm for empirical data modelling, has become increas-
ingly popular in Machine Learning and Applied Statistics over the past few
years, as can be witnessed by the flurry of research activity in this area.
Our work focused on modelling the blind inverse problem of source sepa-
ration under sparsity constraints, understanding the factors which influence
its efficiency, empirically evaluating its performance, and characterizing and
improving its behavior. We studied these issues for the problem of decom-
posing and representing datasets, especially high-dimensional data matrices,
with emphasis on Imaging Science as our primary application domain. We
focused our attention on extracting meaningful activations from weak spatio-
temporal signals such as functional MRI.
Neuroimaging techniques, such as functional magnetic resonance imaging
(fMRI), have revolutionised the field of neuroscience. They provide non-
invasive, or minimally invasive, regionally-specific in vivo measurements of
the brain activity. However, these measurements are only indirect observa-
222
tions of the underlying neural processes. Reconstructing brain activations
from observations is an inverse problem. Consequently, there is a need for
mathematical models, linking observations and domain-specific variables of
interest, and algorithms for making inferences given the data and the mod-
els. Currently, these models are sometimes based on the physical generative
process, but more often on statistical principles.
The main tool for “model-free” decompositions of neuroimaging data into
components currently in use is ICA. While ICA is a powerful decomposition
method, and oftentimes uses sparse source models, its fundamental under-
lying assumption is statistical independence. For many real-world physical
and biological datasets this assumption is not plausible.
In this Thesis we proposed three methods that are based explicitly on
sparsity, and implicitly on a generalized notion of smoothness. We directly
encoded properties of the components that are favored by neuroscientists in
our priors, in particular the properties of smoothness, localization, and spar-
sity. The localization property of activations in space is neuroscientifically
plausible, and leads to a notion of ‘sparsity’ via the use of basis functions
with local support. We emphasize here that locality is fully compatible with
the idea of possibly disconnected, distributed activations/representations.
The smoothness property seems biologically reasonable, as fMRI measures
changes in blood oxygenation, which induce a “blurring” effect, and accounts
for both the spatially extended nature of the brain haemodynamic response
and the distributed nature of neuronal sources. This is one of the main rea-
sons (along with increasing the SNR) for the spatial smoothing preprocessing
typically employed in fMRI image analysis. In contrast to the technique of
Gaussian kernel smoothing with fixed width, however, usually used in prac-
tice, the use of multiresolution families encompases many spatial scales in
modelling, obviating the need to pre-select a particular spatial scale, as well
as provining additional benefits, which will be discussed below. Functional
223
MRI data analysis is therefore a very fruitful application of sparse models.
However, the aim of sparse representation seems intrinsically significant and
fundamental beyond neuroimaging. As discussed in the text, sparseness (with
respect to appropriate dictionaries) lead to energy efficient, minimum entropy
representations, a characteristic that seems to have a physical significance.
The material of this Thesis is split into six Chapters. The context and
background of the research presented in this Thesis was given in the first
chapter and the problem and motivating application of the developed meth-
ods was introduced there as well.
The required background on methods for data decomposition was pro-
vided in the second chapter. The material was organized from the point
of view of second versus higher-order decompositions. The various methods
were motivated as techniques for seeking structure in data. In particular, an
intuitive geometric interpretation was emphasized, where component analy-
sis methods are seen as seeking an intrinsic coordinate system in data space
that best reveals the structure within the data distribution. —This will
later lead to an insight about a possible reason for the apparent success of
ICA-type separation methods for fMRI.— While second-order methods such
as PCA are still valuable as a pre-processing step, e.g. for dimensionality
reduction purposes, the orthogonality constraint implicitly imposed on the
principal components lead to unphysical results if these are interpreted as the
spatial maps and corresponding time-courses. It was emphasized there that
the inverse problem of source extraction requires statistics beyond beyond
simple covariance and correlation. However, PCA can be recovered as an
optimization problem that minimizes the least-squares reconstruction error.
—This observation forms the basis for the penalized decomposition model
that enforces sparsity by adding a sparsity-inducing penalty term described
in the third chapter.— Various independent component analysis algorithms
were then introduced and their probabilistic interpretation was given. ICA
224
is currently the standard workhorse for exploratory analysis in fMRI, and
forms the ‘golden standard’ for model-free methods. The importance of ap-
propriate source model choice was then stressed and exemplified on a simple
example from image separation. The failure of ICA to separate those sources,
although independently generated, however, was given as a first indication
of the need for a different paradigm for source recovery, sparsity.
The main topic of this Thesis, sparse representation and data decomposi-
tion, was then introduced. Sparse representations were motivated by results
from the neuroscience community, and in particular the ability of the sparse
model of Olshausen and Field [137] to reproduce the receptive fields in the
primary visual cortex by learning sparse representations of natural scenes.
This lead to a line of research in vision, statistics, and machine learning into
efficient representations of data. For our purposes, sparse representation
forms the foundation for source separation and decomposition of data matri-
ces. The methods of Donoho [50], and Zibulevsky and Pearlmutter [176] are
particularly relevant here, the latter actually forming the base formulation
for the sparse component analysis model presented in Chapter 3. However,
the former method is not adapted to the geomerty of the patterns of activa-
tions in fMRI while the latter suffers from the problems of gradient methods.
In Chapters 3 and 4 we propose methods that address these issues. Prob-
abilistic modelling plays again an important role here, by allowing natural
extensions to more general model representations. Chapter 2 concluded with
an introduction to the Bayesian approach to data analysis, providing the
statistical framework for this Thesis.
Chapters 3 to 5 described the new methods for sparse data decomposi-
tion and blind source separation developed for this Thesis. Motivated by the
results of Benharrosh et al. [17], Golden [77], and Daubechies et al. [46],
in Chapter 3 we proposed to view the problem of blind source separation as
an inverse problem with sparsity constraints. This allows us to use results
225
from that field, and in particular the recently introduced method of iterative
thresholding. With the judicious use of appropriate constraints, we regu-
larize the inversion toward the desired solution for our problem domain. In
particular, we impose constraints that capture the neuroscientifically plausi-
ble properties of smoothness, localization, and sparsity. This Thesis makes
the following contributions:
• Views BSS as an inverse problem and provides an interpretation of the
resulting objective function in an energy minimization framework.
• Applies a novel algorithm for BSS, Iterative Thresholding.
• Proposes a novel way of estimating the crucial threshold parameter.
• Empirically evaluates its performance on complex spatio-temporal datasets
such as fMRI.
In particular, inspired by the energy model of Olshausen and Field [137],
we first formulated the problem in an energy minimization framework using
non-quadratic constraints that promote sparse (minimum entropy) solutions.
This allows us to make the connection with results from the neural compu-
tation literature and to formulate the optimization functional in an intuitive
way. We then followed Donoho’s [50] and Zibulevsky and Pearlmutter’s [176]
idea of representing the unknown signals in a signal dictionary. The central
idea here is to directly exploit the sparse representability of the spatial part
of the solution in appropriate dictionaries. Using mathematical arguments,
we defined precisely the kind of dictionary that has the potential to produce
sparse representations for the class of signals we are interested in extracting.
We then adapted the recently developed method of iterative thresholding of
Daubechies et al. [45] to the case of sparse representations with an unknown
basis and proposed a practical, if heuristic, method for estimating the cru-
cial, threshold parameter for fMRI data. After a few simulations that aimed
226
at demonstrating the applicability of SCA-IT to acoustic and natural image
data we turned to our main application, fMRI data decomposition. We first
applied our method to the simulated experiment of [46] and showed that it
successfully separated mixtures in which standard ICA fails. Furthermore,
in contrast to standard ICA, when applied to real fMRI data our model was
capable of extracting activations that are represented in single spatial maps,
and which could be interpreted as corresponding to a single physiological
cause.
The threshold estimation procedure proposed in Chapter 3, although
works well in practice for fMRI data, is rather ad-hoc. It made certain
assumptions about a number of settings that may not hold in more general
data processing problems. A more principled approach would be needed if
the model aims to be appealing in a variety of blind inverse problem sit-
uations. We addressed the above problems under a Bayesian paradigm in
Chapter 4. The interpretation of classical models from a probabilistic point
of view introduced in the second chapter highlights better the structure of
component analysis models in general: one can define a “family tree of com-
ponent analysis models” that are constructed using different probabilistic as-
sumptions about the latent signals, the observation operator and the noise,
and that are linked using increasingly sophisticated priors. This structure
can be visualized using directed acyclic graphs. In Chapter 4 we proposed
to reinterpret the variational optimization functional of Daubechies et al.
[45] from a Bayesian point of view. We derived an efficient variational EM-
type algorithm for blind linear inverse problems by lower-bounding the data
likelihood and we showed that it contains the original iterative thresholding
algorithm of Daubechies et al. as a special case. As in the original IT al-
gorithm of Daubechies et al., our algorithm is guaranteed to convergence,
and to increase the data likelihood at each iteration, due to the convergence
properties of the EM algorithm. In contrast to the original variational func-
227
tional, however, where the regularization coefficients have to be set a-priori
or estimated using a variety of ad-hoc procedures, in our algorithm these
have a precise probabilistic meaning, as hyperparameters of probability dis-
tributions. This allows us to estimate them from the data itself. We exploit
this fact and use a model for the expansion coefficients of the latent signals
that adapts to the sparsity characteristics of the individual signals at hand.
We prove that the optimal value of the threshold is precisely a function of
those hyperparameters and the system noise precision. As also noted by
MacKay [116], in contrast to the classical methods, the Bayesian approach
automatically uses the semantically correct objects during estimation: for
the threshold on the expansion coefficients of the latent signals it only makes
sense to use the state noise precision, not the sensor one.
We applied Bayesian Iterative Thresholding to the problem of blind source
separation. We first experimented with an audio separation problem used in
[147], [?], and [149] as benchmark. Our method achieved higher reconstruc-
tion accuracy over these previous methods. On a more real-world task, that
of separating fMRI data, the Bayesian IT algorithm performed better than
the energy IT (which in turn outperformed ICA on the same task).
Finally, the 5th Chapter dealt with the related problem of sparse matrix
factorization. We especially focused on decomposing data matrices that rep-
resent spatio-temporal data. Given a data matrix, the goal is to factorize
it into a product of a matrix containing the dynamics times a matrix con-
taining the spatial variation, or equivalently, as an inference problem, invert
the effect of these two factors. We enforced sparsity by transforming the
spatial dimension of signals in a wavelet basis and using a sparse prior on
the corresponding wavelet coefficients. This sparsity-enforcing combination
allowed us to avoid the problems mentioned in remark 6, that is that directly
modelling the gray-scale values of an image using a MoG may result in loss
of geometric information. Spatial wavelets have a larger support than vox-
228
els and function as “super-voxels” capturing local geometric features. Then
the sparse prior models distributions in feature-space, not voxel intensities
themselves. To ensure that a parsimonious set of temporal bases is inferred
we used a hierarchical, ‘sparse Bayesian’-style prior and automatic relevance
determination. Signal reconstruction and parameter estimation were carried
out in a fully Bayesian paradigm. As in all modelling work, there is a trade-off
between mathematical elegance and efficiency. For large-scale datasets ob-
taining an analytic solution is desirable. We chose to work in the variational
Bayesian framework, using distributions from the conjugate-exponential fam-
ily. The model typically converged in far fewer iterations than the iterative
thresholding ones. Testing the model on benchmark fMRI data, it outper-
formed state-of-the-art ICA algorithms while at the same time extracting
much cleaner maps, which can potentially offer better interpretability. We
finally experimented with using a manifold learning technique, ‘Locality Pre-
serving Projections’ (LPP), for dimensionality reduction with very promising
results.
Open Questions and Future Work
Although the performance and benefits of the sparse decomposition algo-
rithms presented in this Thesis have been comprehensively demonstrated,
there are still some practical issues to address. We discuss these, along with
some suggestions for possible future extensions, below.
• The overcomplete source separation experiment for the coctail party
problem discussed in Sect. 3.4.2 (Lee et al., [107]) provided results that
are on par with those of [107] but not better than those of [76]. This
is not an inherent deficiency of the method, however. As noted by
Zibulevsky et al. in [177] and Li et al. in [111] wavelet bases are not
229
the most adapted to this types of signals and the best results were
obtained using the spectrogram directly. More experiments, using a
better dictionary, could be performed using this approach.
• For the energy IT model, an interesting alternative to explore and com-
pare with for estimating the regularization coefficients is the homotopy
method. As noted earlier, the optimization cost function of Eq. 3.11
can be seen as searching for a minimizer that is a smooth interpolation
between a solution minimizing an ℓ2 measure (for the data-fit term) and
a sparseness penalty term, controlled by the Lagrange multiplier λC.
The homotopy method explores the whole regularization path between
these two extremes. This will provide a principled way of applying
the method to a wider variety of applications beyond fMRI. We also
note that the least angle regression (LARS) method [57] used the same
approach as an extension to LASSO sparse modeling [81].
• While the wavelet transform is known to largely decorrelate the obser-
vations, there are still some residual dependencies among wavelet co-
efficients. One could exploit these and design priors that model these
dependencies; see [41], for example. This could lead to even better
separation results. The Bayesian framework directly supports these hi-
erarchical extensions to the models discussed in this Thesis. This is an
ongoing area of research.
• The fully Bayesian variational approach allows for sparse priors and ma-
trix factorization models conditioned directly on wavelet transformed
data. We showed that this significantly improves separation results and
allows for simultaneous unmixing and wavelet-based denoising. Ide-
ally, however, the functional mapping between the observations and the
wavelet coefficients (which we may take to be non-linear as well) would
230
be inferred along with the sparse matrix factorization (SMF) model.
This leads naturally to extensions of non-linear SMF. A promising av-
enue of research along this direction is to combine the VB-SMF model
with the data-dependent non-linear transformations coming from the
manifold learning field. Preliminary experiments with such a combina-
tion were shown in the 5th Chapter.
• For the VB-SMF model, the prior for the timecourses, as it is now, does
not exploit the temporal structure in the signals. Timeseries models
could provide a better extraction of the dynamics of the activations.
Work in this area by Woolrich et al. [171] and others has provided
examples and methodologies of such an approach.
• The software implementation of the models discussed in this Thesis
is this of research-level prototypes. Therefore, in terms of speed and
robustness, there is still work to be done. From a software engineer-
ing viewpoint, specialized data structures for storing, processing, and
retrieving the huge datasets encountered in fMRI could be used. An-
other promising area is to exploit the inherent parallelism offered by
advances in hardware. Variational Bayesian learning can be imple-
mented as message-passing on graphical probabilistic models, leading
to the Variational Message Passing algorithm [170], which factorizes
over cliques. This hints at the possibility of implementing sets of esti-
mation equations for performing inference that are updated in parallel.
• Finally, as commented on earlier in the Thesis, full results of applying
sparse models on neuroscientific data, and their importance for brain
science will have to be thoroughly investingated. This should be done
in collaboration with experts in the field of neuroscience.
231
Bibliography
[1] F. Abramovich and B. W. Silverman. Wavelet decomposition ap-proaches to statistical inverse problems. Biometrika, 85(1):115–129,1998.
[2] S.-I. Amari. Natural Gradient Works Efficiently in Learning. NeuralComputation, 10:251–276, 1998.
[3] H. Asari, B. Pearlmutter, and A. Zador. Sparse representations for thecocktail party problem. Journal of Neuroscience, 26(28):7477–7490,2006.
[4] J. Atick and A. Redlich. Towards a theory of early visual processing.Neural Computation, 2(3):308–320, 1990.
[5] J. Atick and A. Redlich. What does the retina know about naturalscenes? Neural Computation, 4(2):196–210, Mar. 1992.
[6] H. Attias. Independent Factor Analysis. Neural Computation,11(4):803–851, May 1999.
[7] H. Attias. A Variational Bayesian Framework for Graphical Models.In In Advances in Neural Information Processing Systems 12, pages209–215. MIT Press, 2000.
[8] A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra. Clustering on the UnitHypersphere using von Mises-Fisher Distributions. J. Mach. Learn.Res., 6:1345–1382, December 2005.
[9] R. Baraniuk. Compressive Signal Processing, 2010.
[10] Barlow. Unsupervised Learning. Neural Comp., 1(3):295–311, 1989.
[11] H. B. Barlow. Unsupervised Learning. Neural Computation, 1(3):295–311, 1989.
232
[12] H.B. Barlow, T P Kaushal, and G.J. Mitchison. Finding minimumentropy codes. Neural Computation, 1:412–423, 1989.
[13] H.B. Barlow, T.P. Kaushal, and G.J. Mitchison. Finding MinimumEntropy Codes. Neural Computation, 1(3):412–423, 1989.
[14] J. Basak, A. S., D. Trivedi, M. S. Santhanam, T.-W. Lee, and E. Oja.Weather data mining using independent component analysis. Journalof Machine Learning Research, 5:239–253, 2004.
[15] C. Beckmann and S. Smith. Probabilistic independent component anal-ysis for functional magnetic resonance imaging. IEEE Trans. Med.Imag., 23:137–152, 2004.
[16] A. Bell and T. J. Sejnowski. An information-maximisation approachto blind separation and blind deconvolution. Neural Computation,7(6):1129–1159, 1995.
[17] M. Benharrosh, E. Roussos, S. Takerkart, K. D’Ardenne, W. Richter,J. Cohen, and I. Daubeschies. ICA components in fMRI analysis: Inde-pendent sources? In 10th Neuroinformatics Annual Meeting, A Decadeof Neuroscience Informatics: Looking Ahead, April 2004.
[18] A. Blake and A. Zisserman. Visual Reconstruction. MIT Press, 1987.
[19] V.I. Bogachev. Gaussian measures. Amer. Math. Soc., 1998.
[20] B. Borchers. A Bayesian Approach to the Discrete Linear Inverse Prob-lem, 2003.
[21] E. Bullmore, J. Fadili, M. Breakspear, R. Salvador, J. Suckling, andM. Brammer. Wavelets and statistical analysis of functional magneticresonance images of the human brain. Statistical Methods in MedicalResearch, 12:375–399, 2003.
[22] W. L. Buntine. Operations for Learning with Graphical Models. Jour-nal of Artificial Intelligence Research, 2:159–225, 1994.
[23] V. D. Calhoun and T. Adali. Unmixing fMRI with independent com-ponent analysis. IEEE Engineering in Medicine and Biology Magazine,25(2):79–90, March-April 2006.
[24] V.D. Calhoun, T. Adali, G.D. Pearlson, and J.J. Pekar. Spatial andtemporal independent component analysis of functional MRI data con-taining a pair of task-related waveforms. Human Brain Mapping,13(1):43–53, 2001.
233
[25] D. Calvetti and E. Somersalo. An Introduction to Bayesian ScientificComputing: Ten Lectures on Subjective Computing, volume 2 of Sur-veys and Tutorials in the Applied Mathematical Sciences. Springer,2007.
[26] J.-F. Cardoso. Infomax and maximum likelihood for blind separation.IEEE Signal Process. Lett., 4(4):112–114, 1997.
[27] J.-F. Cardoso. Blind signal separation: statistical principles. Proceed-ings of the IEEE, 90(8):2009–2026, Oct. 1998.
[28] J.-F. Cardoso. The three easy routes to independent component anal-ysis; contrasts and geometry. In Proc. of the ICA 2001 workshop, SanDiego, Dec. 2001.
[29] M. A. Carreira-Perpinan. Continuous latent variable models for di-mensionality reduction and sequential data reconstruction. PhD thesis,University of Sheffield, UK, 2001.
[30] M. Carroll, G. Cecchi, I. Rish, R. Garg, and A. Rao. Prediction andinterpretation of distributed neural activity with sparse models. Neu-roImage, 44(1):112–122, 2009.
[31] A. Chambolle, R. A. DeVore an N. Lee, and B. J. Lucier. Nonlin-ear wavelet image processing: Variational problems, compression, andnoise removal through wavelet shrinkage. IEEE Trans. on Image Proc.,7(3):319–355, July 1998.
[32] S. Chen, D. L. Donoho, and M. A. Saunders. Atomic Decompositionby Basis Pursuit. SIAM J. Sci. Comput., 20:33–61, 1998.
[33] Z. Chen and S. Haykin. On different facets of regularization theory.Neural Computation, 14:2791–2846, December 2002.
[34] H. Choi and R. G. Baraniuk. Wavelet Statistical Models and BesovSpaces. In Proceedings of SPIE Technical Conference on Wavelet Ap-plications in Signal Processing VII, Denver, USA, Jul. 1999.
[35] H. Choi and R. G. Baraniuk. Multiple Wavelet Basis Image DenoisingUsing Besov Ball Projections. IEEE Signal Processing Letters, 11(9),September 2004.
[36] R. Choudrey, W.D. Penny, and S.J. Roberts. An ensemble learningapproach to independent component analysis. In Proceedings of the
234
2000 IEEE Signal Processing Society Workshop, pages 435–444. IEEENeural Networks for Signal Processing X, 2000.
[37] R.A. Choudrey. Variational Methods for Bayesian Independent Com-ponent Analysis. PhD thesis, Department of Engineering Science, Uni-versity of Oxford, 2003.
[38] R.A. Choudrey and S.J. Roberts. Flexible Bayesian Independent Com-ponent Analysis for Blind Source Separation. In Proceedings of ICA-2001, San Diego, USA, December 2001.
[39] Pierre Comon. Independent component analysis, a new concept? Sig-nal Processing, 36(3):287–314, 1994.
[40] R. T. Cox. Probability, Frequency, and Reasonable Expectation. Am.Jour. Phys., 14:1–13, 1946.
[41] M.S. Crouse, R.D. Nowak, and R.G. Baraniuk. Wavelet-based statisti-cal signal processing using hidden Markov models. IEEE Transactionson Signal Processing, Apr 1998.
[42] D. L. Donoho and C. Grimes. Hessian eigenmaps: Locally linear em-bedding techniques for high-dimensional data. PNAS, 100(10):5591–5596, May 2003.
[43] G. Darmois. Analyse Generale des Liaisons Stochastiques. Rev. Inst.Internat. Stat., 21:2–8, 1953.
[44] I. Daubechies. Ten lectures on wavelets. Society for Industrial andApplied Mathematics, Philadelphia, PA, USA, 1992.
[45] I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholdingalgorithm for linear inverse problems with a sparsity constraint. Comm.Pure Appl. Math., 57, 2004.
[46] I. Daubechies, E. Roussos, S. Takerkart, M. Benharrosh, C. Golden,K. D’Ardenne, W. Richter, J. Cohen, and J. Haxby. Independent com-ponent analysis for brain fMRI does not select for independence. PNAS,106(26):10415–10412, 2009.
[47] J. Dauwels, S. Korl, and H.-A. Loeliger. Expectation Maximization asMessage Passing. In International Symposium on Information Theory(ISIT), Sept. 2005.
235
[48] A.P. Dempster, N.M. Laird, and D.B Rubin. Maximum Likelihoodfrom Incomplete Data via the EM Algorithm. Journal of the RoyalStatistical Society. Series B (Methodological), 39(1):1–38, 1977.
[49] I. S. Dhillon and D. M. Modha. Concept decompositions for largesparse text data using clustering. Machine Learning, 42(1):143–175,2001.
[50] D. L. Donoho. Sparse components of images and optimal atomic de-compositions. Constructive Approximation, 17:353–382, 2001.
[51] D. L. Donoho and M. Elad. Optimally sparse representation in general(nonorthogonal) dictionaries via ℓ1 minimization. PNAS, 100(5):2197–2202, March 2003.
[52] D. L. Donoho and A. G. Flesia. Can recent advances in harmonicanalysis explain recent findings in natural scene statistics? Network:Computation in Neural Systems, 12:371–393, 2001.
[53] D. L. Donoho and I. M. Johnstone. Ideal spatial adaptation by waveletshrinkage. Biometrika, 81(3):425–55, 1994.
[54] D.L. Donoho. De-noising by soft-thresholding. IEEE Trans. on Inf.Theory, 42(3):613–627, 1995.
[55] D. Dueck, Q. D. Morris, and B. J. Frey. Multi-way clustering of mi-croarray data using probabilistic sparse matrix factorization. Bioinfor-matics, 21(suppl 1), 2005.
[56] A. W. Eckford. Channel estimation in block fading channels using thefactor graph EM algorithm. In Proc. 22nd Biennial Symposium onCommunications, 2004.
[57] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angleregression. Ann. Statist., 32(2):407–499, 2004.
[58] Richard Everson and Stephen Roberts. Independent Component Anal-ysis: A Flexible Nonlinearity and Decorrelating Manifold Approach.Neural Computation, 11(8):1957–1983, November 1999.
[59] F. Girosi and M. Jones and T. Poggio. Regularization Theory andNeural Networks Architectures. Neural Computation, 7(2):219–269,Mar. 1995.
236
[60] F. O’Sullivan. A Statistical Perspective on Ill-Posed Inverse Problems.Statistical Science, 1(4):502–518, Nov. 1986.
[61] H. Farid and E. H. Adelson. Separating Reections and Lighting UsingIndependent Components Analysis. In Proceedings of IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition, vol-ume 1, pages 262–267, Fort Collins CO., 1999.
[62] D. J. Field. Wavelets, vision and the statistics of natural scenes. Phil.Trans. R. Soc. Lond. A, 357:2527–2542, 1999.
[63] D. J. Field. Wavelets, vision and the statistics of natural scenes. Phil.Trans. R. Soc. Lond. A, 357(1760):2527–2542, September 1999.
[64] David J. Field. Relations between the statistics of natural images andthe response properties of cortical cells. J. Opt. Soc. Am. A, 4:2379–2394, 1987.
[65] B. G. Fitzpatrick. Bayesian analysis in inverse problems. Inverse Prob-lems, 7(5), 1991.
[66] G. Flandin and W.D. Penny. Bayesian fMRI data analysis with sparsespatial basis function priors. NeuroImage, 34(3):1108–1125, 2007.
[67] P. Foldiak. Forming sparse representations by local anti-Hebbian learn-ing. Biol. Cybernetics, 64(2):165–170, 1990.
[68] D. Freedman and P. Diaconis. On the histogram as a density estimator:L2 theory. Probability Theory and Related Fields, 57(4):453–476, Dec.1981.
[69] J. H. Friedman and J. W. Tukey. A Projection Pursuit Algorithmfor Exploratory Data Analysis. IEEE Transactions on Computers,23(9):881–890, Sep. 1974.
[70] K. Friston. Models of brain function in neuroimaging. Ann. Rev.Psychol., 56:57–87, 2005.
[71] K. J. Friston, P. Jezzard, and R. Turner. Analysis of functional MRItime-series. Human Brain Mapping, 1(2):153–171, 1994.
[72] Z. Ghahramani. Solving inverse problems using an EM approach todensity estimation. In M.C. Mozer, P. Smolensky, D.S. Touretzky, J.L.Elman, and A.S. Weigend, editors, Proceedings of the 1993 Connec-tionist Models Summer School, page 8, 1994.
237
[73] Z. Ghahramani and M. J. Beal. Variational Inference for BayesianMixtures of Factor Analysers. In Sara A. Solla, Todd K. Leen, andKlaus-Robert Muller, editors, Advances in Neural Information Process-ing Systems 12, [NIPS Conference, Denver, Colorado, USA, November29 - December 4, 1999], pages 449–455. The MIT Press, 1999.
[74] Z. Ghahramani and S. Roweis. Probabilistic models for unsupervisedlearning. Tutorial presented at the 1999 NIPS Conference, 1999.
[75] M. Girolami, editor. Advances in Independent Component Analysis.Perspective on Neural Computing. Springer-Verlag, New York, Aug.2000.
[76] M. Girolami. A Variational Method for Learning Sparse and Overcom-plete Representations. Neural Computation, 13(11):2517–2532, Nov.2001.
[77] C. Golden. Spatio-Temporal Methods in the Analysis of fMRI Data inNeuroscience. PhD thesis, Princeton University, 2005.
[78] A. Groves. Bayesian Learning Methods for Modelling Functional MRI.PhD thesis, Department of Clinical Neurology, University of Oxford,2010.
[79] J. Hadamard. Sur les problemes aux derives partielles et leur signifi-cation physique. Princeton University Bulletin, 13:49–52, 1902.
[80] G.F. Harpur and R.W. Prager. Development of low entropy codingin a recurrent network. Network: Computation in Neural Systems, 7,1996.
[81] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of StatisticalLearning: Data Mining, Inference, and Prediction. Springer, secondedition, February 2009.
[82] J. V. Haxby, M. I. Gobbini, M. L. Furey, A. Ishai, J. L. Schouten,and P. Pietrini. Distributed and overlapping representations of facesand objects in ventral temporal cortex. Science, 293(5539):2425–2430,September 2001.
[83] X. He and P. Niyogi. Locality Preserving Projections. In Advances inNeural Information Processing Systems, volume 16. MIT Press, 2003.
238
[84] S. Hochreiter and J. Schmidhuber. Source separation as a by-product ofregularization. In Advances in Neural Information Processing Systems,pages 459–465. MIT Press, 1999.
[85] D. R. Hunter and K. Lange. A Tutorial on MM Algorithms. TheAmerican Statistician, 58(1), February 2004.
[86] A. Hyvarinen. The Fixed-Point Algorithm and Maximum LikelihoodEstimation for Independent Component Analysis. Neural ProcessingLetters, 10(1):1–5, 1999.
[87] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Anal-ysis. Wiley, June 2001.
[88] A. Hyvarinen and R. Karthikesh. Imposing sparsity on the mixingmatrix in independent component analysis. Neurocomputing, 49:151–162, 2002. Special Issue on ICA and BSS.
[89] A. Hyvarinen and E. Oja. A Fast Fixed-Point Algorithm for Inde-pendent Component Analysis. Neural Computation, 9(7):1483–1492,October 1997.
[90] A. Hyvarinen and E. Oja. Independent Component Analysis: Algo-rithms and Applications. Neural Networks, 13(4–5):411–430, 2000.
[91] M. Ichir and A. MohammadDjafari. Bayesian wavelet based signaland image separation. In Bayesian Inference and Maximum EntropyMethods in Science and Engineering, Jackson Hole, Wyoming (USA),August 2004. American Institute of Physics.
[92] J. Idier, editor. Bayesian Approach to Inverse Problems. Wiley-ISTE,June 2008.
[93] A. Ilin and H. Valpola. On the effect of the form of the posterior ap-proximation in variational learning of ICA models. In 4th InternationalSymposium on Independent Component Analysis and Blind Signal Sep-aration, Nara, Japan, April 2003.
[94] D. J. C. Mackay J. W. Miskin. Application of Ensemble Learning ICAto Infrared Imaging. In Conference: Independent Component Analysis- ICA, 2000.
[95] T. Jaakkola. Variational Methods for Inference and Estimation inGraphical Models. PhD thesis, Mass. Inst. of Techn., 1997.
239
[96] A. Jalobeanu, L. Blanc-Feraud, and J. Zerubia. Natural image mod-eling using complex wavelets. In Proc. SPIE Conference on Wavelets,volume 5207, San Diego, August 2003.
[97] E.T. Jaynes and G.L. Bretthorst (editor). Probability Theory: TheLogic of Science. Cambridge University Press, 2003.
[98] C. Jutten and J. Herault. Blind separation of sources, part I: An adap-tive algorithm based on neuromimetic architecture. Signal Processing,24(1):1–10, 1991.
[99] K.H. Knuth. A Bayesian approach to source separation. In C. JuttenJ.-F. Cardoso and P. Loubaton, editors, Proceedings of the First In-ternational Workshop on Independent Component Analysis and SignalSeparation: ICA’99, pages 283–288, Aussios, France, Jan 1999.
[100] K. Kreutz-Delgado, J. F. Murray, B. D. Rao, K. Engan, T.-W. Lee,and T. J. Sejnowski. Dictionary Learning Algorithms for Sparse Rep-resentation. Neural Computation, 15(2):349–396, February 2003.
[101] L. Landweber. An Iteration Formula for Fredholm Integral Equationsof the First Kind. American Journal of Mathematics, 73(3):615–624,Jul. 1951.
[102] H. Lappalainen. Ensemble learning for independent component analy-sis. In Proceedings of the First International Workshop on IndependentComponent Analyis, pages 7–12, 1999.
[103] H. Lappalainen. Fast Fixed-Point Algorithms for Bayesian BlindSource Separation, 1999.
[104] P. Lauterbur and P. Mansfield (Sir). The nobel prize in physiology ormedicine 2003. Nobelprize.org.
[105] N. D. Lawrence and C. M. Bishop. Variational Bayesian IndependentComponent Analysis. Technical report, , 1999.
[106] N. Lazar. The Statistical Analysis of Functional MRI Data. Statisticsfor Biology and Health. Springer, 2008.
[107] T.-W. Lee, M. S. Lewicki, Mark Girolami, and Terrence J. Sejnowski.Blind Source Separation of More Sources Than Mixtures Using Over-complete Representations. IEEE Signal Processing Letters, 6(4), Apr.1999.
240
[108] D. Leporini and J.-C. Pesquet. Bayesian wavelet denoising: Besov pri-ors and non-Gaussian noises. Signal Processing, 81(1):55–67, January2001. Special section on Markov Chain Monte Carlo (MCMC) Methodsfor Signal Processing.
[109] M. S. Lewicki and T. J. Sejnowski. Learning overcomplete representa-tions. Neural Computation, 12(2):337–365, February 2000.
[110] L. Li and T. Speed. Deconvolution of sparse positive spikes: is it ill-posed? Technical Report 586, University of California, Berkeley, Oct2000.
[111] Y. Li, A. Cichocki, and S.-I. Amari. Analysis of sparse representationand blind source separation. Neural Computation, 16(6):1193–1234,June 1994.
[112] Y. Li, A. Cichocki, S-I. Amari, S. Shishkin, J. Cao, and F. Gu. Sparserepresentation and its applications in blind source separation. In Pro-ceedings of the Annual Conference on Neural Information ProcessingSystems, 2003.
[113] R. Linsker. A local learning rule that enables information maximizationfor arbitrary input distributions. Neural Computation, 9:1661–1665,November 1997.
[114] M. Belkin and P. Niyogi. Laplacian Eigenmaps for DimensionalityReduction and Data Representation. Neural Computation, 15(6):1373–1396, Jun. 2003.
[115] D. MacKay. Probable networks and plausible predictions - a review ofpractical bayesian methods for supervised neural networks. Network:Computation in Neural Systems, 6(3):469–505, 1995.
[116] D. J. C. MacKay. Bayesian Interpolation. Neural Computation,4(3):415–447, May 1992.
[117] D. J. C. MacKay. A Practical Bayesian Framework for Backpropaga-tion Networks. Neural Computation, 4(3):448–472, May 1992.
[118] D. J. C. MacKay. Maximum likelihood and covariant algorithms forindependent component analysis. Technical report, University of Cam-bridge, 1996.
241
[119] J. B. MacQueen. Some methods for classification and analysis of mul-tivariate observations. In Proceedings of 5th Berkeley Symposium onMathematical Statistics and Probability, pages 281–297. University ofCalifornia Press, 1967.
[120] S. Mallat. A theory for multiresolution signal decomposition: Thewavelet representation. IEEE Transactions on PAMI, 11(7), 1989.
[121] S. Mallat. A Wavelet Tour of Signal Processing. Academic Press, 3rdedition, 2008.
[122] S. G. Mallat and Z. Zhang. Matching Pursuits with Time-FrequencyDictionaries. IEEE Transactions on Signal Processing, pages 3397–3415, December 1993.
[123] M. McKeown and T. Sejnowski. Independent component analysis offMRI data: examining the assumptions. Hum. Brain Mapp., 6(5-6):368–372, 1998.
[124] M. J. McKeown, L. K. Hansen, and T. J. Sejnowski. Independentcomponent analysis of functional MRI: what is signal and what is noise?Current Opinion in Neurobiology, 13(5):620–629, 2003.
[125] M.J. McKeown, S. Makeig, T.P. G.G. Brownand Jung, S.S. Kinder-mann, A.J. Bell, and T.J. Sejnowski. Analysis of fMRI data by decom-position into independent spatial components. Human Brain Mapping,6(3):160–188, 1998.
[126] T. Minka. Automatic choice of dimensionality for PCA. In Advancesin Neural Information Processing Systems (NIPS), 2000.
[127] T. P. Minka. Expectation-Maximization as lower bound maximization,1998.
[128] J. W. Miskin and D. J. C. MacKay. Ensemble Learning for Blind SourceSeparation. In S. J. Roberts and R. M. Everson, editors, IndependentComponents Analysis: Principles and Practice. Cambridge UniversityPress, 2001.
[129] P. Moulin and J. Liu. Analysis of Multiresolution Image DenoisingSchemes Using Generalized Gaussian and Complexity Priors. IEEETransactions on Information Theory, 45(3), April 1999.
242
[130] J.-P. Nadal and N. Parga. Nonlinear neurons in the low-noise limit: afactorial code maximizes information transfer. Network: Computationin Neural Systems, 5(4):565–581, 1994.
[131] R. M. Neal. Bayesian Learning for Neural Networks. PhD thesis, Dept.of Computer Science, University of Toronto, 1994.
[132] R. M. Neal and G. E. Hinton. A View of the EM Algorithm thatJustifies Incremental, Sparse, and Other Variants. In M. I. Jordan, ed-itor, Learning in Graphical Models, pages 355–368. Dordrecht: KluwerAcademic Publishers, 1998.
[133] S. Ogawa, T. Lee, A. Kay, and D. Tank. Brain magnetic reso-nance imaging with contrast dependent on blood oxygenation. PNAS,87(24)):9868–9872, 1990.
[134] S Ogawa, T M Lee, A R Kay, and D W Tank. Brain magnetic reso-nance imaging with contrast dependent on blood oxygenation. PNAS,87(24):9868–9872, December 1990.
[135] D. P. O’Leary. Regularization of ill-posed problems in image restora-tion. In J.G. Lewis, editor, Proceedings of the Fifth SIAM Conferenceon Applied Linear Algebra, pages 102–105, Philadelphia, 1994. SIAMPress.
[136] B. A. Olshausen. Learning linear, sparse, factorial codes. TechnicalReport AIM-1580, Massachusetts Institute of Technology, Cambridge,MA, USA, 1996.
[137] B. A. Olshausen and D. J. Field. Emergence of simple-cell receptivefield properties by learning a sparse code for natural images. Nature,381:607–609, June 1996.
[138] B. A. Olshausen and D. J. Field. Sparse Coding with an OvercompleteBasis Set: A Strategy Employed by V1? Vision Research, 37(23):3311–3325, 1997.
[139] J. A. Palmer, D. P. Wipf, K. Kreutz-delgado, and B. D. Rao. Vari-ational EM algorithms for non-Gaussian latent variable models. InAdvances in Neural Information Processing Systems 18, pages 1059–1066. MIT Press, 2006.
[140] J. Pearl. Causality: Models, Reasoning and Inference. CambridgeUniversity Press, New York, NY, USA, 2nd edition, 2009.
243
[141] B. Pearlmutter and L. Parra. Maximum Likelihood Blind Source Sep-aration: A context-sensitive generalization of ICA. In Advances inNeural Information Processing Systems 9, pages 613–619, 1997.
[142] T. Poggio, V. Torre, and C. Koch. Computational vision and regular-ization theory. Nature, 317:314–319, September 1985.
[143] R. M. Neal and J.Zhang. Classication for High Dimensional ProblemsUsing Bayesian Neural Networks and Dirichlet Diusion Trees. NIPSFeature Selection Workshop, 2003.
[144] J. O. Ramsay and B.W. Silverman. Functional data analysis. Springer,New York, 2nd edition, 2005.
[145] C. R. Rao. A decomposition theorem for vector variables with a linearstructure. Ann. Math. Statist., 40(5):1845–1849, 1969.
[146] J. Rissanen. Modeling by shortest data description. Automatica,14:465–471, 1978.
[147] S. Roberts. Independent component analysis: source assessment andseparation, a Bayesian approach. Vision, Image and Signal Processing,IEE Proceedings, 145(3):149–154, Jun. 1998.
[148] S. Roberts and R. Everson, editors. Independent Component Analysis:Principles and Practice. Cambridge University Press, 2001.
[149] S. Roberts, E. Roussos, and R. Choudrey. Hierarchy, priors andwavelets: structure and signal modelling using ICA. Signal Process-ing, 84(2):283–297, 2004. Special Section on Independent ComponentAnalysis and Beyond.
[150] J.K. Romberg, H. Choi, and R.G. Baraniuk. Bayesian tree-structuredimage modeling using wavelet-domain hidden Markov models. IEEETransactions on Image Processing, 10(7):1056–1068, Jul 2001.
[151] E. Roussos. A novel ICA algorithm using mixtures of Gaussians. Tech-nical report, CSBMB, Princeton University, 2007.
[152] E. Roussos, S. Roberts, and I. Daubechies. Variational Bayesian Learn-ing for Wavelet Independent Component Analysis. In Proceedings ofthe 25th International Workshop on Bayesian Inference and MaximumEntropy Methods in Science and Engineering. AIP, 2005.
244
[153] E. Roussos, S. Roberts, and I. Daubechies. Variational Bayesian Learn-ing of Sparse Representations and Its Application in Functional Neu-roimaging. In Springer Lecture Notes on Articial Intelligence (LNAI)– Survey of the State of The Art Series, 2011.
[154] M. W. Seeger. Bayesian Inference and Optimal Design for the SparseLinear Model. Journal of Machine Learning Research, 9:759–813, Apr2008.
[155] E. P. Simoncelli. Vision and the statistics of the visual environment.Current Opinion in Neurobiology, 13(2):144–149, April 2003.
[156] D. S. Sivia and J. Skilling. Data Analysis: A Bayesian Tutorial. OxfordUniversity Press, second edition edition, 2006.
[157] P. Skudlarski, R.T. Constable, and J. C. Gore. ROC Analysis of Sta-tistical Methods Used in Functional MRI: Individual Subjects. Neu-roImage, 9(3):311–329, March 1999.
[158] S. Smith, M. Jenkinson, M. Woolrich, C. Beckmann, T. Behrens,H. Johansen-Berg, P. Bannister, M. De Luca, I. Drobnjak, D. Flitney,R. Niazy, J. Saunders, J. Vickers, Y. Zhang, N. De Stefano, J. Brady,and P. Matthews. Advances in functional and structural MR imageanalysis and implementation as FSL. NeuroImage, 23(S1):208–219,2004.
[159] N. Srebro and T. Jaakkola. Sparse Matrix Factorization of Gene Ex-pression Data. Technical report, MIT Artif. Intel. Lab., 2001. Unpub-lished note.
[160] J. V. Stone. Independent Component Analysis: A Tutorial Introduc-tion. MIT Press, September 2004.
[161] J. V. Stone, J. Porrill, C. Buchel, and K. Friston. Spatial, Temporal,and Spatiotemporal Independent Component Analysis of fMRI Data.18th Leeds Statistical Research Workshop on Spatial-temporal mod-elling and its applications, July 1999.
[162] A. Tarantola. Inverse Problem Theory and Model Parameter Estima-tion. SIAM, 2005.
[163] Y. W. Teh, M. Welling, S. Osindero, G. E. Hinton, T.-W. Lee, J.-F.Cardoso, E. Oja, and S.-I. Amari. Energy-based models for sparseovercomplete representations. Journal of Machine Learning Research,4:2003, 2003.
245
[164] R. Tibshirani. Regression shrinkage and selection via the lasso. J.Royal. Statist. Soc B., 58(1):267–288, 1996.
[165] A. N. Tikhonov and V. Y. Arsenin. Solution of Ill-posed Problems.Winston & Sons, Washington, 1977.
[166] M. E. Tipping. Sparse Bayesian learning and the relevance vectormachine. J. Mach. Learn. Res., 1:211–244, Sep 2001.
[167] A. Tonazzini, L. Bedini, and E. Salerno. A Markov Model for BlindImage Separation by a Mean-Field EM algorithm. IEEE Trans. ImageProc., 15(2), 2006.
[168] F. Turkheimer, M. Brett, J. Aston, A. Leff, P. Sargent, R. Wise,P. Grasby, and V. Cunningham. Statistical modelling of PET im-ages in wavelet space. Journal of Cerebral Blood Flow and Metabolism,20:1610–1618, 2001.
[169] M. Welling and M. Weber. A constrained EM algorithm for Indepen-dent Component Analysis. Neural Computation, 13:677–689, 2001.
[170] J. Winn and C. Bishop. Variational message passing. J. Mach. Learn.Res., 6:661–694, December 2005.
[171] M. W. Woolrich, M. Jenkinson, J. M. Brady, and S. M. Smith. FullyBayesian spatio-temporal modeling of FMRI data. IEEE Trans. Med.Imaging, 23(2):213–31, Feb. 2004.
[172] K. Worsley and K. Friston. Analysis of fMRI time series revisited—again. NeuroImage, 2:173–181, 1995.
[173] X. Shen and F. G. Meyer. Low-dimensional embedding of fMRIdatasets. NeuroImage, 41:886–902, 2008.
[174] A.L. Yuille and A. Rangarajan. The Concave-Convex Procedure(CCCP). Neural Computation, 15(4):915–936, Apr. 2003.
[175] R.S. Zemel. A minimum description length framework for unsuper-vised learning. PhD thesis, University of Toronto, Dept. of ComputerScience, 1993.
[176] M. Zibulevsky and B. A. Pearlmutter. Blind source separation by sparsedecomposition in a signal dictionary. Neural Computation, 13:863–882,April 2001.
246
[177] M. Zibulevsky, B. A. Pearlmutter, P. Bofill, and P. Kisilev. Blindsource separation by sparse decomposition in a signal dictionary. InS. J. Roberts and R. M. Everson, editors, Independent ComponentsAnalysis: Principles and Practice, pages 181–208. Cambridge Univer-sity Press, 2001.
[178] H. Zou and T. Hastie. Regularization and variable selection via theelastic net. Journal of the Royal Statistical Society - Series B: StatisticalMethodology, 67(2):301–320, 2005.
247