handbook of natural computing || independent component analysis
TRANSCRIPT
13 Independent ComponentAnalysis
G. Rozenbe# Springer
Seungjin ChoiPohang University of Science and Technology, Pohang, South Korea
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4362
Why Independent Components? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4373
Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4394
Natural Gradient Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4435
Flexible ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4446
Differential ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4477
Nonstationary Source Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4498
Spatial, Temporal, and Spatiotemporal ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4519
Algebraic Methods for BSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45210
Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45611
Further Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45612
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457rg et al. (eds.), Handbook of Natural Computing, DOI 10.1007/978-3-540-92910-9_13,
-Verlag Berlin Heidelberg 2012
436 13 Independent Component Analysis
Abstract
Independent component analysis (ICA) is a statistical method, the goal of which is to
decompose multivariate data into a linear sum of non-orthogonal basis vectors with coeffi-
cients (encoding variables, latent variables, and hidden variables) being statistically indepen-
dent. ICA generalizes widely used subspace analysis methods such as principal component
analysis (PCA) and factor analysis, allowing latent variables to be non-Gaussian and
basis vectors to be non-orthogonal in general. ICA is a density-estimation method where a linear
model is learned such that the probability distribution of the observed data is best captured, while
factor analysis aims at bestmodeling the covariance structure of the observed data.We beginwith
a fundamental theory and present various principles and algorithms for ICA.
1 Introduction
Independent component analysis (ICA) is a widely used multivariate data analysis method
that plays an important role in various applications such as pattern recognition, medical image
analysis, bioinformatics, digital communications, computational neuroscience, and so on.
ICA seeks a decomposition of multivariate data into a linear sum of non-orthogonal basis
vectors with coefficients being statistically as independent as possible.
We consider a linear generative model where m-dimensional observed data x 2 m is
assumed to be generated by a linear combination of n basis vectors fai 2 mg,x ¼ a1s1 þ a2s2 þ � � � þ ansn ð1Þ
where fsi 2 g are encoding variables, representing the extent to which each basis vector is
used to reconstruct the data vector. Given N samples, the model (> Eq. 1) can be written in a
compact form
X ¼ AS ð2Þwhere X ¼ ½xð1Þ; . . . ; xðNÞ� 2 m�N is a data matrix, A ¼ ½a1; . . . ; an� 2 m�n is a basis
matrix, and S ¼ ½sð1Þ; . . . ; sðNÞ� 2 n�N is an encoding matrix with s(t) ¼ [s1(t), . . .,sn(t)]⊤.
Dual interpretation of basis encoding in the model (>Eq. 2) is given as follows:
� When columns in X are treated as data points in m-dimensional space, columns in A are
considered as basis vectors and each column in S is encoding that represents the extent to
which each basis vector is used to reconstruct data vector.
� Alternatively, when rows in X are data points in N-dimensional space, rows in S corre-
spond to basis vectors and each row in A represents encoding.
A strong application of ICA is a problem of blind source separation (BSS), the goal of which
is to restore sources S (associated with encodings) without the knowledge of A, given the data
matrix X. ICA and BSS have often been treated as an identical problem since they are closely
related to each other. In BSS, the matrix A is referred to as amixing matrix. In practice, we find
a linear transformation W, referred to as a demixing matrix, such that the rows of the output
matrix
Y ¼ WX ð3Þ
Independent Component Analysis 13 437
are statistically independent. Assume that sources (rows of S) are statistically independent. In
such a case, it is well known that WA becomes a transparent transformation when the rows of
Y are statistically independent. The transparent transformation is given by WA ¼ PL, whereP is a permutation matrix and L is a nonsingular diagonal matrix involving scaling. This
transparent transformation reflects two indeterminacies in ICA (Comon 1994): (1) scaling
ambiguity; (2) permutation ambiguity. In other words, entries of Y correspond to scaled and
permuted entries of S.
Since Jutten and Herault’s first solution (Jutten and Herault 1991) to ICA, various
methods have been developed so far, including a neural network approach (Cichocki and
Unbehauen 1996), information maximization (Bell and Sejnowski 1995), natural gradient (or
relative gradient) learning (Amari et al. 1996; Cardoso and Laheld 1996; Amari and Cichocki
1998), maximum likelihood estimation (Pham 1996; MacKay 1996; Pearlmutter and Parra
1997; Cardoso 1997), and nonlinear principal component analysis (PCA) (Karhunen 1996;
Oja 1995; Hyvarinen and Oja 1997). Several books on ICA (Lee 1998; Hyvarinen et al. 2001;
Haykin 2000; Cichocki and Amari 2002; Stone 1999) are available, serving as a good resource
for a thorough review and tutorial on ICA. In addition, tutorial papers on ICA (Hyvarinen
1999; Choi et al. 2005) are useful resources.
This chapter begins with a fundamental idea, emphasizing why independent components
are sought. Then, well-known principles to tackle ICA are introduced, leading to an objective
function to be optimized. The natural gradient algorithm for ICA is explained. We also
elucidate how we incorporate nonstationarity or temporal information into the standard
ICA framework.
2 Why Independent Components?
PCA is a popular subspace analysis method that has been used for dimensionality reduction and
feature extraction. Given a data matrix X 2 m�N , the covariance matrix Rxx is computed by
Rxx ¼ 1
NXHX>
where H ¼ IN�N � 1N1N1
>N is the centering matrix, where IN�N is the N � N identity matrix
and 1N ¼ ½1; . . . ; 1�> 2 N . The rank-n approximation of the covariance matrix Rxx is of
the form
Rxx � ULU>
where U 2 m�n contains n eigenvectors associated with n largest eigenvalues of Rxx in its
columns and the corresponding eigenvalues are in the diagonal entries of L (diagonal matrix).
Then, principal components z(t) are determined by projecting data points x(t) onto these
eigenvectors, leading to
zðtÞ ¼ U>xðtÞor in a compact form
Z ¼ U>X
It is well known that rows of Z are uncorrelated with each other.
. Fig. 1
Two-dimensional data with two main arms are fitted by two different basis vectors: (a) principal
component analysis (PCA) makes the implicit assumption that the data have a Gaussian
distribution and determines the optimal basis vectors that are orthogonal, which are not
efficient at representing non-orthogonal distributions; (b) independent component analysis
(ICA) does not require that the basis vectors be orthogonal and considers non-Gaussian
distributions, which is more suitable in fitting more general types of distributions.
438 13 Independent Component Analysis
ICA generalizes PCA in the sense that latent variables (components) are non-Gaussian and
A is allowed to be a non-orthogonal transformation, whereas PCA considers only orthogonal
transformation and implicitly assumes Gaussian components. > Figure 1 shows a simple
example, emphasizing the main difference between PCA and ICA.
A core theorem that plays an important role in ICA is presented. It provides a fundamental
principle for various unsupervised learning algorithms for ICA and BSS.
Theorem 1 (Skitovich–Darmois) Let {s1, s2, . . . , sn} be a set of independent random variables.
Consider two random variables x1 and x2 which are linear combinations of {si},
y1 ¼ a1s1 þ � � � þ ansny2 ¼ b1s1 þ � � � þ bnsn
ð4Þ
where {ai} and {bi} are real constants. If y1 and y2 are statistically independent, then each variablesi for which aibi 6¼ 0 is Gaussian.
Consider the linear model (> Eq. 2) form¼ n. Throughout this chapter, the simplest case,
wherem ¼ n (square mixing), is considered. The global transformation can be defined as G¼WA, where A is the mixing matrix and W is the demixing matrix. With this definition, the
output y(t) can be written as
yðtÞ ¼ WxðtÞ ¼ GsðtÞ ð5ÞIt is assumed that both A and W are nonsingular, hence G is nonsingular. Under this
assumption, one can easily see that if {yi(t)} are mutually independent non-Gaussian signals,
then invoking >Theorem 1, G has the following decomposition
G ¼ PL ð6ÞThis explains how ICA performs BSS.
Independent Component Analysis 13 439
3 Principles
The task of ICA is to estimate the mixing matrix A or its inverse W ¼ A�1 (referred to as the
demixing matrix) such that elements of the estimate y ¼ A�1x ¼ Wx are as independent as
possible. For the sake of simplicity, the index t is often left out if the time structure does not
have to be considered. In this section, four different principles are reviewed: (1) maximum
likelihood estimation; (2) mutual information minimization; (3) information maximization;
and (4) negentropy maximization.
3.1 Maximum Likelihood Estimation
Suppose that sources s are independent with marginal distributions qi(si)
qðsÞ ¼Yni¼1
qiðsiÞ ð7Þ
In the linear model, x ¼ As, a single factor in the likelihood function is given by
pðxjA; qÞ ¼Z
pðxjs;AÞqðsÞds
¼Z Yn
j¼1
d xj �Xni¼1
Ajisi
!Yni¼1
qiðsiÞdsð8Þ
¼ jdetAj�1Yni¼1
qi
Xnj¼1
A�1ij xj
!ð9Þ
Then, we have
pðxjA; qÞ ¼ jdetAj�1rðA�1xÞ ð10Þ
The log-likelihood is written as
log pðxjA; qÞ ¼ � log jdetAj þ log qðA�1xÞ ð11Þwhich can be also written as
log pðxjW ; qÞ ¼ log jdetW j þ log pðyÞ ð12Þwhere W ¼ A�1 and y is the estimate of s with the true distribution q(�) replaced by a
hypothesized distribution p(�). Since sources are assumed to be statistically independent,
(>Eq. 12) is written as
log pðxjW ; qÞ ¼ log jdetW j þXni¼1
log piðyiÞ ð13Þ
The demixing matrix W is determined by
bW ¼ argmaxW
log jdetW j þXni¼1
log piðyiÞ( )
ð14Þ
440 13 Independent Component Analysis
It is well known that maximum likelihood estimation is equivalent to Kullback matching
where the optimal model is estimated by minimizing the Kullback–Leibler (KL) divergence
between the empirical distribution and the model distribution. Consider KL divergence from
the empirical distribution ~pðxÞ to the model distribution py(x) ¼ p(x jA,q)
KL½ ~pðxÞjjpyðxÞ� ¼Z
~pðxÞ log ~pðxÞpyðxÞ
dx ¼ �Hð~pÞ �Z
~pðxÞ log pyðxÞdx ð15Þ
where Hð~pÞ ¼ � R ~pðxÞ log~pðxÞdx is the entropy of ~p. Given a set of data points, {x1, . . .,xN}
drawn from the underlying distribution p(x), the empirical distribution ~pðxÞ puts probability1N
on each data point, leading to
~pðxÞ ¼ 1
N
XNt¼1
dðx � xt Þ ð16Þ
It follows from (> Eq. 15) that
argminy
KL½ ~pðxÞjjpyðxÞ� � argmaxy
log pyðxÞh i~p ð17Þ
where �h i~p represents the expectation with respect to the distribution ~p. Plugging (>Eq. 16)
into the right-hand side of (>Eq. 15), leads to
log pyðxÞh i~p ¼1
N
Z Xt¼1
Ndðx � xtÞ log pyðxÞdx ¼ 1
N
XNt¼1
log pyðxt Þ ð18Þ
Apart from the scaling factor 1N, this is just the log-likelihood function. In other words,
maximum likelihood estimation is obtained from the minimization of (> Eq. 15).
3.2 Mutual Information Minimization
Mutual information is a measure for statistical independence. Demixing matrix W is learned
such that the mutual information of y ¼ Wx is minimized, leading to the following objective
function:
jmi ¼Z
pðyÞ log pðyÞQni¼1 piðyiÞ
� �dy
¼ �HðyÞ �Xni¼1
log piðyiÞ* +
y
ð19Þ
where H(�) represents the entropy, that is,
HðyÞ ¼ �Z
pðyÞ log pðyÞdy ð20Þ
and h�iy denotes the statistical average with respect to the distribution p(y). Note that
pðyÞ ¼ pðxÞj detW j . Thus, the objective function (>Eq. 19) is given by
jmi ¼ � log jdetW j �Xni¼1
log piðyiÞ� � ð21Þ
Independent Component Analysis 13 441
where hlogp(x)i is left out since it does not depend on parametersW. For online learning, only
instantaneous value is taken into consideration, leading to
jmi ¼ � log jdetW j �Xni¼1
log piðyiÞ ð22Þ
3.3 Information Maximization
Infomax (Bell and Sejnowski 1995) involves the maximization of the output entropy z ¼ g(y)
where y ¼ Wx and g(�) is a squashing function (e.g., g iðyiÞ ¼ 11þe�yi
). It was shown that
Infomax contrast maximization is equivalent to the minimization of KL divergence between
the distribution of y¼Wx and the distribution pðsÞ ¼Qni¼1 piðsiÞ. In fact, Infomax is nothing
but mutual information minimization in the CA framework.
The Infomax contrast function is given by
jI ðW Þ ¼ HðgðWxÞÞ ð23Þwhere g(y)¼ [g1(y1), . . .,gn(yn)]
⊤. If gi(·) is differentiable, then it is the cumulative distribution
function of some probability density function qi(�),
giðyiÞ ¼Z yi
�1qiðsiÞdsi
Let us choose a squashing function gi(yi)
giðyiÞ ¼1
1þ e�yið24Þ
where g ið�Þ : ! ð0; 1Þ is a monotonically increasing function.
Let us consider an n-dimensional random vector bs, the joint distribution of which is
factored into the product of marginal distributions:
qðbsÞ ¼Yni¼1
qiðbsiÞ ð25Þ
Then g iðbsiÞ is distributed uniformly on (0,1), since gi(�) is the cumulative distribution
function of bsi . Define u ¼ gðbsÞ ¼ ½g1ðbs1Þ; . . . ; gnðbsnÞ�>, which is distributed uniformly on
(0,1)n.
Define v ¼ g(Wx). Then, the Infomax contrast function is rewritten as
jIðW Þ ¼ HðgðWxÞÞ¼ HðvÞ¼ �
ZpðvÞ log pðvÞdv
¼ �Z
pðvÞ log pðvÞQni¼1 1ð0;1ÞðviÞ
� �dv
¼ �KL½vku�¼ �KL½gðWxÞku�
ð26Þ
442 13 Independent Component Analysis
where 1(0,1)(�) denotes a uniform distribution on (0,1). Note that KL divergence is invariant
under an invertible transformation f,
KL½ f ðuÞkf ðvÞ� ¼ KL½ukv�¼ KL½ f �1ðuÞkf �1ðvÞ�
Therefore, we have
jIðW Þ ¼ �KL½gðWxÞku�¼ �KL½Wxkg�1ðuÞ�¼ �KL½Wxkbs � ð27Þ
It follows from (>Eq. 27) that maximizing jI ðW Þ (Infomax principle) is identical to the
minimization of the KL divergence between the distribution of the output vector y ¼Wx and
the distribution bs whose entries are statistically independent. In other words, Infomax is
equivalent to mutual information minimization in the framework of ICA.
3.4 Negentropy Maximization
Negative entropy or negentropy is a measure of distance to Gaussianity, yielding a larger
value for a random variable whose distribution is far from Gaussian. Negentropy is
always nonnegative and vanishes if and only if the random variable is Gaussian. Negentropy
is defined as
JðyÞ ¼ HðyGÞ �HðyÞ ð28Þwhere HðyÞ ¼ f� log pðyÞg represents the entropy and yG is a Gaussian random vector
whose mean vector and covariance matrix are the same as y. In fact, negentropy is the KL
divergence of p(yG) from p(y), that is
JðyÞ ¼ KL pðyÞkpðyGÞ� ¼Z
pðyÞ log pðyÞpðyGÞ dy
ð29Þ
leading to (> Eq. 28).
Let us discover a relation between negentropy and mutual information. To this end, we
consider mutual information I(y):
IðyÞ ¼ Iðy1; . . . ; ynÞ
¼Xni¼1
HðyiÞ �HðyÞ
¼Xni¼1
HðyGi Þ �Xni¼1
JðyiÞ þ JðyÞ �HðyGÞ
¼ JðyÞ �Xni¼1
JðyiÞ þ1
2log
Qni¼1½Ryy �iidetRyy
� �ð30Þ
where Ryy ¼ fyy>g and [Ryy]ii denotes the ith diagonal entry of Ryy .
Independent Component Analysis 13 443
Assume that y is already whitened (decorrelated), that is, Ryy ¼ I. Then the sum of
marginal negentropies is given byXni¼1
JðyiÞ ¼ JðyÞ � IðyÞ þ 1
2log
Qni¼1½Ryy �iidetRyy
� �|fflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflffl}
0
¼ �HðyÞ �Z
pðyÞ log pðyGÞdy � IðyÞ
¼ �HðxÞ � log j detW j � IðyÞ �Z
pðyÞ log pðyGÞdy
ð31Þ
Invoking Ryy ¼ I, (> Eq. 31) becomesXni¼1
JðyiÞ ¼ �IðyÞ �HðxÞ � log jdetW j þ 1
2log jdetRyy j ð32Þ
Note that
1
2log jdetRyy j ¼ 1
2log jdetðWRxxW
>Þj ð33Þ
Therefore, we have Xni¼1
JðyiÞ ¼ �IðyÞ ð34Þ
where irrelevant terms are omitted. It follows from (>Eq. 34) that maximizing the sum of
marginal negentropies is equivalent to minimizing the mutual information.
4 Natural Gradient Algorithm
In > Sect. 3, four different principles lead to the same objective function
j ¼ � log jdetW j �Xni¼1
log piðyiÞ ð35Þ
That is, ICA boils down to learning W, which minimizes (> Eq. 35),
bW ¼ argminW
� log j detW j �Xni¼1
log piðyiÞ( )
ð36Þ
An easy way to solve (> Eq. 36) is the gradient descent method, which gives a learning
algorithm for W that has the form
DW ¼ ��@j
@W
¼ �� W�> � ’ðyÞx>� � ð37Þ
where �>0 is the learning rate and ’(y) ¼ [’1(y1),. . . ,’n(yn)]⊤ is the negative score function
whose ith element ’i(yi) is given by
’iðyiÞ ¼ � d log piðyiÞdyi
ð38Þ
444 13 Independent Component Analysis
A popular ICA algorithm is based on natural gradient (Amari 1998), which is known to be
efficient since the steepest descent direction is used when the parameter space is on a
Riemannian manifold. The natural gradient ICA algorithm can be derived (Amari et al. 1996).
Invoking (> Eq. 38), we have
d �Xni¼1
log qiðyiÞ( )
¼Xni¼1
’iðyiÞdyi ð39Þ
¼ ’>ðyÞdy ð40Þwhere ’(y) ¼ [’1(y1). . .’n(yn)]
⊤ and dy is given in terms of dW as
dy ¼ dWW�1y ð41ÞDefine a modified coefficient differential dV as
dV ¼ dWW�1 ð42ÞWith this definition, we have
d �Xni¼1
log qiðyiÞ( )
¼ ’>ðyÞdVy ð43Þ
We calculate an infinitesimal increment of log jdetW j, then we have
dflog j detW jg ¼ trfdVg ð44Þwhere tr{·} denotes the trace that adds up all diagonal elements.
Thus, combining > Eqs. 43 and > 44 gives
dj ¼ ’>ðyÞdVy � trfdVg ð45ÞThe differential in (> Eq. 45) is in terms of the modified coefficient differential matrix dV.
Note that dV is a linear combination of the coefficient differentials dWij. Thus, as long as dW is
nonsingular, dV represents a valid search direction to minimize (> Eq. 35), because dV spans
the same tangent space of matrices as spanned by dW. This leads to a stochastic gradient
learning algorithm for V given by
DV ¼ ��dj
dV
¼ �fI � ’ðyÞy>gð46Þ
Thus the learning algorithm for updating W is described by
DW ¼ �DVW
¼ � I � ’ðyÞy>� �W
ð47Þ
5 Flexible ICA
The optimal nonlinear function ’i(yi) is given by (> Eq. 38). However, it requires knowledge
of the probability distributions of sources that are not available. A variety of hypoth-
esized density models has been used. For example, for super-Gaussian source signals, the
unimodal or hyperbolic-Cauchy distribution model (MacKay 1996) leads to the nonlinear
function given by
’iðyiÞ ¼ tanhðbyiÞ ð48Þ
Independent Component Analysis 13 445
Such a sigmoid function was also used by Bell and Sejnowski (1995). For sub-Gaussian source
signals, the cubic nonlinear function ’iðyiÞ ¼ y3i has been a favorite choice. For mixtures of
sub- and super-Gaussian source signals, according to the estimated kurtosis of the extracted
signals, the nonlinear function can be selected from two different choices (Lee et al. 1999).
The flexible ICA (Choi et al. 2000) incorporates the generalized Gaussian density model
into the natural gradient ICA algorithm, so that the parameterized nonlinear function
provides flexibility in learning. The generalized Gaussian probability distribution is a set of
distributions parameterized by a positive real number a, which is usually referred to as the
Gaussian exponent of the distribution. The Gaussian exponent a controls the ‘‘peakiness’’ of
the distribution. The probability density function (PDF) for a generalized Gaussian is
described by
pðy; aÞ ¼ a2lG 1
a
� e�jylja ð49Þ
where G(x) is the Gamma function given by
GðxÞ ¼Z 1
0
tx�1e�t dt ð50Þ
Note that if a¼ 1, the distribution becomes the standard ‘‘Laplacian’’ distribution. If a¼ 2, the
distribution is a standard normal distribution (see > Fig. 2).
For a generalized Gaussian distribution, the kurtosis can be expressed in terms of the
Gaussian exponent, given by
ka ¼G 5
a
�G 1
a
�G2 3
a
� � 3 ð51Þ
. Fig. 2
The generalized Gaussian distribution is plotted for several different values of Gaussian
exponent, a = 0.8, 1, 2, 4.
. Fig. 3
The plot of kurtosis ka versus Gaussian exponent a: (a) for a leptokurtic signal; (b) for a
platykurtic signal.
446 13 Independent Component Analysis
The plots of kurtosis ka versus the Gaussian exponent a for leptokurtic and platykurtic signals
are shown in > Fig. 3.
From the parameterized generalized Gaussian density model, the nonlinear function in the
algorithm (> Eq. 47) is given by
’iðyiÞ ¼d log piðyiÞ
dyi
¼ jyijai�1sgnðyiÞ
ð52Þ
where sgn(yi) is the signum function of yi.
Note that for ai ¼ 1, ’i(yi) in (> Eq. 38) becomes a signum function (which can also be
derived from the Laplacian density model for sources). The signum nonlinear function is
favorable for the separation of speech signals since natural speech is often modeled as a
Laplacian distribution. Note also that for ai¼ 4, ’i(yi) in (> Eq. 38) becomes a cubic function,
which is known to be a good choice for sub-Gaussian sources.
In order to select a proper value of the Gaussian exponent ai, we estimate the kurtosis of
the output signal yi and select the corresponding ai from the relationship in > Fig. 3. The
kurtosis of yi, ki, can be estimated via the following iterative algorithm:
kiðt þ 1Þ ¼ M4iðt þ 1ÞM2
2iðt þ 1Þ � 3 ð53Þ
where
M4iðt þ 1Þ ¼ ð1� dÞM4iðtÞ þ djyiðtÞj4 ð54Þ
M2iðt þ 1Þ ¼ ð1� dÞM2iðtÞ þ djyiðtÞj2 ð55Þ
where d is a small constant, say 0.01.
In general, the estimated kurtosis of the demixing filter output does not exactly match
the kurtosis of the original source. However, it provides an idea about whether the estimated
source is a sub-Gaussian signal or a super-Gaussian signal. Moreover, it was shown (Cardoso
Independent Component Analysis 13 447
1997; Amari and Cardoso 1997) that the performance of the source separation is not degraded
even if the hypothesized density does not match the true density. For these reasons, we suggest
a practical method where only several different forms of nonlinear functions are used.
6 Differential ICA
In a wide sense, most ICA algorithms based on unsupervised learning belong to the
Hebb-type rule or its generalization with adopting nonlinear functions. Motivated by
the differential Hebb rule (Kosko 1986) and differential decorrelation (Choi 2002, 2003),
we introduce an ICA algorithm employing differential learning and natural gradients, which
leads to a differential ICA algorithm. A random walk model is first introduced for latent
variables, in order to show that differential learning is interpreted as the maximum likelihood
estimation of a linear generative model. Then the detailed derivation of the differential ICA
algorithm is presented.
6.1 Random Walk Model for Latent Variables
Given a set of observation data, {x(t)}, the task of learning the linear generative model
(>Eq. 1) under the constraint of the latent variables being statistically independent, is a
semiparametric estimation problem. The maximum likelihood estimation of basis vectors {ai}
involves a probabilistic model for latent variables that are treated as nuisance parameters.
In order to show a link between the differential learning and the maximum likelihood
estimation, a random walk model for latent variables si(t) is considered, which is a simple
Markov chain, that is,
siðtÞ ¼ siðt � 1Þ þ eiðtÞ ð56Þwhere the innovation ei(t) is assumed to have zero mean with a density function qi(ei(t)).In addition, innovation sequences {ei(t)} are assumed to be mutually independent white
sequences, that is, they are spatially independent and temporally white as well.
Let us consider latent variables si(t) over an N-point time block. The vector si can be
defined as
si ¼ ½sið0Þ; . . . ; siðN � 1Þ�> ð57ÞThen the joint PDF of si can be written as
piðsiÞ ¼ pi sið0Þ; . . . ; siðN � 1Þð Þ
¼YN�1
t¼0
piðsiðtÞjsiðt � 1ÞÞ ð58Þ
where si(t) ¼ 0 for t < 0 and the statistical independence of the innovation sequences is taken
into account.
It follows from the random walk model (> Eq. 56) that the conditional probability density
of si(t) given its past samples can be written as
piðsiðtÞjsiðt � 1ÞÞ ¼ qiðeiðtÞÞ ð59Þ
448 13 Independent Component Analysis
Combining (>Eq. 58) and (>Eq. 59) leads to
piðsiÞ ¼YN�1
t¼0
qiðeiðtÞÞ
¼YN�1
t¼0
qi s0iðtÞ � ð60Þ
where s 0i ðtÞ ¼ siðtÞ � siðt � 1Þ, which is the first-order approximation of the differentiation.
Taking the statistical independence of the latent variables and (> Eq. 60) into account,
then the joint density pðs1; . . . ; snÞ can be written as
pðs1; . . . ; snÞ ¼Yni¼1
piðsiÞ
¼YN�1
t¼0
Yni¼1
qi s 0i ðtÞ � ð61Þ
The factorial model given in (> Eq. 61) will be used as an optimization criterion in deriving
the differential ICA algorithm.
6.2 Algorithm
Denote a set of observation data by
x ¼ fx1; . . . ; xng ð62Þwhere
xi ¼ ½xið0Þ; . . . ; xiðN � 1Þ�> ð63ÞThen the normalized log-likelihood is given by
1
Nlog pðxjAÞ ¼ � log detAj j þ 1
Nlog pðs1; . . . ; snÞ
¼ � log detAj j þ 1
N
XN�1
t¼0
Xni¼1
log qiðs 0i ðtÞÞð64Þ
The inverse of A is denoted by W ¼ A�1. The estimate of latent variables is denoted by
y(t) ¼ Wx(t). With these defined variables, the objective function (i.e., the negative normal-
ized log-likelihood) is given by
jdi ¼ � 1
Nlog pðxjAÞ
¼ � log detWj j � 1
N
XN�1
t¼0
Xni¼1
log qiðy 0i ðtÞÞ
ð65Þ
where si is replaced by its estimate yi and y 0i ðtÞ ¼ yiðtÞ � yiðt � 1Þ (the first-order approxi-
mation of the differentiation).
For online learning, the sample average is replaced by the instantaneous value. Hence the
online version of the objective function (>Eq. 65) is given by
jdi ¼ � log detWj j �Xni¼1
log qiðy 0i ðtÞÞ ð66Þ
Independent Component Analysis 13 449
Note that objective function (>Eq. 66) is slightly different from (> Eq. 35) used in the
conventional ICA based on the minimization of mutual information or the maximum
likelihood estimation.
We derive a natural gradient learning algorithm that finds a minimum of (>Eq. 66). To
this end, we follow the way that was discussed in (Amari et al. 1997; Amari 1998; Choi et al.
2000). The total differential d jdiðW Þ due to the change dW is calculated
djdi ¼ jdiðW þ dW Þ �jdiðW Þ
¼ d � log detWj jf g þ d �Xni¼1
log qiðy 0i ðtÞÞ
( )ð67Þ
Define
’iðy 0i Þ ¼ � d log qiðy 0
i Þdy 0
i
ð68Þ
and construct a vector ’(y 0) ¼ ’1(y01),. . .,’n(y
0n)
⊤.
With this definition, we have
d �Xni¼1
log qiðy 0i ðtÞÞ
( )¼Xni¼1
’iðy 0i ðtÞÞdy 0
i ðtÞ
¼ ’>ðy 0ðtÞÞdy 0ðtÞð69Þ
One can easily see that
d � log detWj jf g ¼ tr dWW�1� � ð70Þ
Define a modified differential matrix dV by
dV ¼ dWW�1 ð71ÞThen, with this modified differential matrix, the total differential djdiðW Þ is computed as
djdi ¼ �tr dVf g þ ’>ðy 0ðtÞÞdVy 0ðtÞ ð72ÞA gradient descent learning algorithm for updating V is given by
V ðt þ 1Þ ¼ V ðtÞ � �tdjdi
dV
¼ �t I � ’ðy 0ðtÞÞy 0>ðtÞ� � ð73Þ
Hence, it follows from the relation (>Eq. 71) that the updating rule for W has the form
W ðt þ 1Þ ¼ W ðtÞ þ �t I � ’ðy 0ðtÞÞy 0>ðtÞ� �W ðtÞ ð74Þ
7 Nonstationary Source Separation
So far, we assumed that sources are stationary random processes where the statistics do not
vary over time. In this section, we show how the natural gradient ICA algorithm is modified to
handle nonstationary sources. As in Matsuoka et al. (1995), the following assumptions are
made in this section.
450 13 Independent Component Analysis
AS1 The mixing matrix A has full column rank.
AS2 Source signals {si(t)} are statistically independent with zero mean. This implies that the
covariance matrix of the source signal vector, Rs(t) ¼ E{s(t)s⊤(t)}, is a diagonal matrix,
that is,
RsðtÞ ¼ diagfr1ðtÞ; . . . ; rnðtÞg ð75Þwhere ri(t) ¼ E{s2i (t)} and E denotes the statistical expectation operator.
AS3riðtÞrjðtÞ (i, j ¼ 1, . . . , n and i 6¼ j) are not constant with time.
It should be pointed out that the first two assumptions (AS1, AS2) are common in most
existing approaches to source separation, however the third assumption (AS3) is critical in this
chapter. For nonstationary sources, the third assumption is satisfied and it allows one to
separate linear mixtures of sources via second-order statistics (SOS).
For stationary source separation, the typical cost function is based on the mutual infor-
mation, which requires knowledge of the underlying distributions of the sources. Since the
probability distributions of the sources are not known in advance, most ICA algorithms rely
on hypothesized distributions (e.g., see Choi et al. 2000 and references therein). Higher-order
statistics (HOS) should be incorporated either explicitly or implicitly.
For nonstationary sources, Matsuoka et al. have shown that the decomposition (>Eq. 6) is
satisfied if cross-correlations E{yi(t)yj(t)} (i, j ¼ 1, . . . , n, i 6¼ j) are zeros at any time instant t,
provided that the assumptions (AS1)–(AS3) are satisfied. To eliminate cross-correlations, the
following cost function was proposed in Matsuoka et al. (1995),
j Wð Þ ¼ 1
2
Xni¼1
log Efy2i ðtÞg � log det E yðtÞy>ðtÞ� � �( )ð76Þ
where det(�) denotes the determinant of a matrix. The cost function given in (> Eq. 76) is a
nonnegative function, which takes minima if and only if E{yi(t)yj(t)} ¼ 0, for i, j ¼ 1, . . . , n,
i 6¼ j. This is the direct consequence of Hadamard’s inequality, which is summarized below.
Theorem 2 (Hadamard’s inequality) Suppose K ¼ [kij] is a nonnegative definite symmetric
n � n matrix. Then,
detðKÞ Yni¼1
kii ð77Þ
with equality iff kij ¼ 0, for i 6¼ j.
Take the logarithm on both sides of (> Eq. 77) to obtainXni¼1
log kii � log detðKÞ 0 ð78Þ
Replacing the matrix K by E{y(t)y⊤(t)}, one can easily see that the cost function (> Eq. 76) has
the minima iff E{yi(t)yj(t)} ¼ 0, for i, j ¼ 1, . . . ,n and i 6¼ j.
We compute
d log detðEfyðtÞy>ðtÞgÞ� � ¼ 2d log detWf g þ d log detCðtÞf g¼ 2tr W�1dW
� �þ d log detCðtÞf g ð79Þ
Define a modified differential matrix dV as
dV ¼ W�1dW ð80Þ
Independent Component Analysis 13 451
Then, we have
dXni¼1
log Efy2i ðtÞg( )
¼ 2Efy>ðtÞL�1ðtÞdVyðtÞg ð81Þ
Similarly, we can derive the learning algorithm for W that has the form
DW ðtÞ ¼ �t I � L�1ðtÞyðtÞy>ðtÞ� �W ðtÞ
¼ �tL�1ðtÞ LðtÞ � yðtÞy>ðtÞ� �
W ðtÞ ð82Þ
8 Spatial, Temporal, and Spatiotemporal ICA
ICA decomposition, X ¼ AS, inherently has duality. Considering the data matrix X 2 m�N
where each of its rows is assumed to be a time course of an attribute, ICA decomposition
produces n independent time courses. On the other hand, regarding the data matrix in the
form of X⊤, ICA decomposition leads to n independent patterns (for instance, images in fMRI
or arrays in DNA microarray data).
The standard ICA (where X is considered) is treated as temporal ICA (tICA). Its dual
decomposition (regarding X⊤) is known as spatial ICA (sICA). Combining these two ideas
leads to spatiotemporal ICA (stICA). These variations of ICA were first investigated in Stone
et al. (2000). Spatial ICA or spatiotemporal ICA were shown to be useful in fMRI image
analysis (Stone et al. 2000) and gene expression data analysis (Liebermeister 2002; Kim and
Choi 2005).
Suppose that the singular value decomposition (SVD) of X is given by
X ¼ U DV T ¼ U D1=2� �
V D1=2� �T¼ eUeVT ð83Þ
where U 2 m�n, D 2 n�n, and V 2 N�n for n min(m,N).
8.1 Temporal ICA
Temporal ICA finds a set of independent time courses and a corresponding set of dual
unconstrained spatial patterns. It embodies the assumption that each row vector in eV>
consists of a linear combination of n independent sequences, that is, eV> ¼ eATST ; where
ST 2 n�N has a set of n independent temporal sequences of length N, and eAT 2 n�n is an
associated mixing matrix.
Unmixing by YT ¼ WTeV> whereWT ¼ PeA�1
T , leads one to recover the n dual patterns ATassociated with the n independent time courses, by calculating AT ¼ eUW�1
T , which is a
consequence of eX ¼ ATYT ¼ eU eV> ¼ eUW�1T YT .
8.2 Spatial ICA
Spatial ICA seeks a set of independent spatial patterns SS and a corresponding set of dual
unconstrained time courses AS. It embodies the assumption that each row vector in eU> is
452 13 Independent Component Analysis
composed of a linear combination of n independent spatial patterns, that is, eU> ¼ eASSS , where
SS 2 n�m contains a set of n independent m-dimensional patterns and eAS 2 n�n is an
encoding variable matrix (mixing matrix).
Define YS ¼ WSeU> where WS is a permuted version of eA�1
S . With this definition, the n
dual time courses AS 2 N�n associated with the n independent patterns are computed as
AS ¼ eVW�1S , since eX> ¼ ASYS ¼ eUeVT ¼ eVW�1
S YS . Each column vector of AS corresponds to
a temporal mode.
8.3 Spatiotemporal ICA
In linear decomposition sICA enforces independence constraints over space to find a set of
independent spatial patterns, whereas tICA embodies independence constraints over time to
seek a set of independent time courses. Spatiotemporal ICA finds a linear decomposition by
maximizing the degree of independence over space as well as over time, without necessarily
producing independence in either space or time. In fact, it allows a trade-off between the
independence of arrays and the independence of time courses.
Given eX ¼ eUeVT, stICA finds the following decomposition:eX ¼ S>S LST ð84Þwhere SS 2 n�m contains a set of n independent m-dimensional patterns, ST 2 n�N has a
set of n independent temporal sequences of lengthN, and L is a diagonal scaling matrix. There
exist two n � n mixing matrices, WS and WT, such that SS ¼ WSeU> and ST ¼ WT
eV>. Thefollowing relation eX ¼ S>S LST
¼ eUW>S LWT
eV>
¼ eUeVT
ð85Þ
implies that WS⊤LWT ¼ I, which leads to
WT ¼ W�TS L�1 ð86Þ
Linear transforms, WS and WT , are found by jointly optimizing objective functions
associated with sICA and tICA. That is, the objective function for stICA has the form
jstICA ¼ ajsICA þ ð1� aÞjtICA ð87Þwhere jsICA and jtICA could be Infomax criteria or log-likelihood functions and a defines the
relative weighting for spatial independence and temporal independence. More details on
stICA can be found in Stone et al. (2002).
9 Algebraic Methods for BSS
Up to now, online ICA algorithms in a framework of unsupervised learning have been
introduced. In this section, several algebraic methods are explained for BSS where matrix
decomposition plays a critical role.
Independent Component Analysis 13 453
9.1 Fundamental Principle of Algebraic BSS
Algebraic methods for BSS often make use of the eigen-decomposition of correlation matrices
or cumulant matrices. Exemplary algebraic methods for BSS include FOBI (Cardoso 1989),
AMUSE (Tong et al. 1990), JADE (Cardoso and Souloumiac 1993), SOBI (Belouchrani et al.
1997), and SEONS (Choi et al. 2002). Some of these methods (FOBI and AMUSE) are based
on simultaneous diagonalization of two symmetric matrices. Methods such as JADE, SOBI,
and SEONS make use of joint approximate diagonalization of multiple matrices (more than
two). The following theorem provides a fundamental principle for algebraic BSS, justifying
why simultaneous diagonalization of two symmetric data matrices (one of them is assumed to
be positive definite) provides a solution to BSS.
Theorem 3 Let L1;D1 2 n�n be diagonal matrices with positive diagonal entries and
L2;D2 2 n�n be diagonal matrices with nonzero diagonal entries. Suppose that G 2 n�n
satisfies the following decompositions:
D1 ¼ GL1G> ð88Þ
D2 ¼ GL2G> ð89Þ
Then the matrix G is the generalized permutation matrix, i.e., G = PL if D�11 D2 and L�1
1 L2 have
distinct diagonal entries.
Proof It follows from (>Eq. 88) that there exists an orthogonal matrix Q such that
GL12
1
� �¼ D
12
1
� �Q ð90Þ
Hence,
G ¼ D12
1QL�1
2
1 ð91ÞSubstitute (>Eq. 91) into (> Eq. 89) to obtain
D�11 D2 ¼ QL�1
1 L2Q> ð92Þ
Since the right-hand side of (>Eq. 92) is the eigen-decomposition of the left-hand side of
(>Eq. 92), the diagonal elements of D�11 D2 and L�1
1 L2 are the same. From the assumption
that the diagonal elements of D�11 D2 and L�1
1 L2 are distinct, the orthogonal matrix Q must
have the form Q¼ PC, whereC is a diagonal matrix whose diagonal elements are either +1 or
�1. Hence, we have
G ¼ D12
1PCL�12
1
¼ PP>D12
1PCL�12
1
¼ PL
ð93Þ
where
L ¼ P>D12
1PCL�1
2
1
which completes the proof.
454 13 Independent Component Analysis
9.2 AMUSE
As an example of >Theorem 3, we briefly explain AMUSE (Tong et al. 1990), where a BSS
solution is determined by simultaneously diagonalizing the equal-time correlation matrix of x
(t) and a time-delayed correlation matrix of x(t).
It is assumed that sources {si(t)} (entries of s(t)) are uncorrelated stochastic processes with
zero mean, that is,
fsiðtÞsjðt � tÞg ¼ dijgiðtÞ ð94Þwhere dij is the Kronecker delta and gi(t) are distinct for i ¼ 1,. . .,n, given t. In other words,
the equal-time correlation matrix of the source, Rssð0Þ ¼ fsðtÞs>ðtÞg, is a diagonal matrix
with distinct diagonal entries. Moreover, a time-delayed correlation matrix of the source,
RssðtÞ ¼ fsðtÞs>ðt � tÞg, is diagonal as well, with distinct nonzero diagonal entries.
It follows from (>Eq. 2) that the correlation matrices of the observation vector x(t) satisfy
Rxxð0Þ ¼ ARssð0ÞA> ð95ÞRxxðtÞ ¼ ARssðtÞA> ð96Þ
for some nonzero time-lag t and both Rss(0) and Rss(t) are diagonal matrices since sources are
assumed to be spatially uncorrelated.
Invoking >Theorem 3, one can easily see that the inverse of the mixing matrix, A�1, canbe identified up to its rescaled and permuted version by the simultaneous diagonalization of
Rxx(0) and Rxx(t), provided that R�1ss (0)Rss(t) has distinct diagonal elements. In other words, a
linear transformationW is determined such that Ryy(0) and Ryy(t) of the output y(t) ¼Wx(t)
are simultaneously diagonalized:
Ryyð0Þ ¼ ðWAÞRssð0ÞðWAÞ>
RyyðtÞ ¼ ðWAÞRssðtÞðWAÞ>
It follows from >Theorem 3 that WA becomes the transparent transformation.
9.3 Simultaneous Diagonalization
We explain how two symmetric matrices are simultaneously diagonalized by a linear
transformation. More details on simultaneous diagonalization can be found in Fukunaga
(1990). Simultaneous diagonalization consists of two steps (whitening followed by a unitary
transformation):
1. First, the matrix Rxx(0) is whitened by
zðtÞ ¼ D�1
2
1 U>1 xðtÞ ð97Þ
where D1 and U1 are the eigenvalue and eigenvector matrices of Rxx(0),
Rxxð0Þ ¼ U1D1U>1 ð98Þ
Then, we have
Rzzð0Þ ¼ D�1
2
1 U>1 Rxxð0ÞU1D
�12
1 ¼ Im
RzzðtÞ ¼ D�1
2
1 U>1 RxxðtÞU1D
�12
1
Independent Component Analysis 13 455
2. Second, a unitary transformation is applied to diagonalize the matrix Rzz(t). The eigen-decomposition of Rzz(t) has the form
RzzðtÞ ¼ U2D2U>2 ð99Þ
Then, y(t) ¼ U>2 z(t) satisfies
Ryyð0Þ ¼ U>2 Rzzð0ÞU2 ¼ Im
RyyðtÞ ¼ U>2 RzzðtÞU2 ¼ D2
Thus, both matrices Rxx(0) and Rxx(t) are simultaneously diagonalized by a linear transform
W ¼ U>2 D
�12
1 U>1 . It follows from >Theorem 3 that W ¼ U>
2 D�1
2
1 U>1 is a valid demixing
matrix if all the diagonal elements of D2 are distinct.
9.4 Generalized Eigenvalue Problem
The simultaneous diagonalization of two symmetric matrices can be carried out without going
through two-step procedures. From the discussion in > Sect. 9.3, we have
WRxxð0ÞW> ¼ In ð100ÞWRxxðtÞW> ¼ D2 ð101Þ
The linear transformation W, which satisfies (> Eq. 100) and (>Eq. 101) is the eigenvector
matrix of Rxx�1(0)Rxx(t) (Fukunaga 1990). In other words, the matrix W is the generalized
eigenvector matrix of the pencil Rxx(t)�lRxx(0) (Molgedey and Schuster 1994).
Recently Chang et al. (2000) proposed thematrix pencil method for BSSwhere they exploited
Rxx(t1) and Rxx(t2) for t1 6¼ t2 6¼ 0. Since the noise vector was assumed to be temporally white,
two matrices Rxx(t1) and Rxx(t2) are not theoretically affected by the noise vector, that is,
Rxxðt1Þ ¼ ARssðt1ÞA> ð102ÞRxxðt2Þ ¼ ARssðt2ÞA> ð103Þ
Thus, it is obvious that we can find an estimate of the demixing matrix that is not sensitive to
the white noise. A similar idea was also exploited in Choi and Cichocki (2000a, b).
In general, the generalized eigenvalue decomposition requires the symmetric-definite
pencil (one matrix is symmetric and the other is symmetric and positive definite). However,
Rxx(t2)�lRxx(t1) is not symmetric-definite, which might cause a numerical instability prob-
lem that results in complex-valued eigenvectors.
The set of all matrices of the form R1�lR2 with l 2 is said to be a pencil. Frequently we
encounter the case where R1 is symmetric and R2 is symmetric and positive definite. Pencils of
this variety are referred to as symmetric-definite pencils (Golub and Loan 1993).
Theorem 4 (Golub and Loan 1993, p. 468) If R1 � lR2 is symmetric-definite, then there exists
a nonsingular matrix U ¼ u1; . . . ; un½ � such that
U>R1U ¼ diag g1ðt1Þ; . . . ; gnðt1Þf g ð104ÞU>R2U ¼ diag g1ðt2Þ; . . . ; gnðt2Þf g ð105Þ
Moreover R1ui ¼ liR2ui for i ¼ 1, . . . , n, and li ¼ giðt1Þgiðt2Þ.
456 13 Independent Component Analysis
It is apparent from >Theorem 4 that R1 should be symmetric and R2 should be symmetric
and positive definite so that the generalized eigenvector U can be a valid solution if {li} aredistinct.
10 Software
Avariety of ICA software is available. ICACentral (URL:http://www.tsi.enst.fr/icacentral/) was
created in 1999 to promote research on ICA and BSS by means of public mailing lists, a
repository of data sets, a repository of ICA/BSS algorithms, and so on. ICA Central might be
the first place where you can find data sets and ICA algorithms. In addition, here are some
widely used software packages.
� ICALAB Toolboxes (http://www.bsp.brain.riken.go.jp/ICALAB/): ICALAB is an ICA
Matlab software toolbox developed in the Laboratory for Advanced Brain Signal Proces-
sing in the RIKEN Brain Science Institute, Japan. It consists of two independent packages,
including ICALAB for signal processing and ICALAB for image processing, and each
package contains a variety of algorithms.
� FastICA (http://www.cis.hut.fi/projects/ica/fastica/): This is the FastICA Matlab package
that implements fast fixed-point algorithms for non-Gaussianity maximization
(Hyvarinen et al. 2001). It was developed in the Helsinki University of Technology,
Finland, and other environments (R, C++, Physon) are also available.
� Infomax ICA (http://www.cnl.salk.edu/�tewon/ica_cnl.html): Matlab and C codes for Bell
and Sejnowski’s Infomax algorithm (Bell and Sejnowski 1995) and extended Infomax (Lee
1998) where a parametric density model is incorporated into Infomax to handle both
super-Gaussian and sub-Gaussian sources.
� EEGLAB (http://sccn.ucsd.edu/eeglab/): EEGLAB is an interactive Matlab toolbox for
processing continuous and event-related EEG, MEG, and other electrophysiological
data using ICA, time/frequency analysis, artifact rejection, and several modes of data
visualization.
� ICA: DTU Toolbox (http://isp.imm.dtu.dk/toolbox/ica/): ICA: DTU Toolbox is a collection
of ICA algorithms that includes: (1) icaML, which is an efficient implementation of
Infomax; (2) icaMF, which is an iterative algorithm that offers a variety of possible source
priors and mixing matrix constraints (e.g., positivity) and can also handle over- and
under-complete mixing; and (3) icaMS, which is a ‘‘one shot’’ fast algorithm that requires
time correlation between samples.
11 Further Issues
� Overcomplete representation: Overcomplete representation enforces the latent space di-
mension n to be greater than the data dimension m in the linear model (> Eq. 1).
Sparseness constraints on latent variables are necessary to learn fruitful representation
(Lewicki and Sejnowski 2000).
� Bayesian ICA: Bayesian ICA incorporates uncertainty and prior distributions of latent
variables in the model (>Eq. 1). Independent factor analysis (Attias 1999) is pioneering
work along this direction. The EM algorithm for ICA was developed in Welling and
Independent Component Analysis 13 457
Weber (2001) and a full Bayesian ICA (also known as ensemble learning) was developed in
Miskin and MacKay (2001).
� Kernel ICA: Kernel methods were introduced to consider statistical independence in
reproducing kernel Hilbert space (Bach and Jordan 2002), developing kernel ICA.
� Nonnegative ICA: Nonnegativity constraints were imposed on latent variables, yielding
nonnegative ICA (Plumbley 2003). Rectified Gaussian priors can also be used in Bayesian
ICA to handle nonnegative latent variables.
� Sparseness: Sparseness is another important characteristic of sources, besides indepen-
dence. Sparse component analysis is studied in Lee et al. (2006).
� Beyond ICA: Independent subspace analysis (Hyvarinen and Hoyer 2000) and tree-
dependent component analysis (Bach and Jordan 2003) generalizes ICA, allowing intra-
dependence structures in feature subspaces or clusters.
12 Summary
ICA has been successfully applied to various applications of machine learning, pattern
recognition, and signal processing. A brief overview of ICA has been presented, starting
from fundamental principles on learning a linear latent variable model for parsimonious
representation. Natural gradient ICA algorithms were derived in the framework of maximum
likelihood estimation, mutual information minimization, Infomax, and negentropy maximi-
zation. We have explained flexible ICA where generalized Gaussian density was adopted such
that a flexible nonlinear function was incorporated into the natural gradient ICA algorithm.
Equivariant nonstationary source separation was presented in the framework of natural
gradient as well. Differential learning was also adopted to incorporate a temporal structure
of sources. We also presented a core idea and various methods for algebraic source separation.
Various software packages for ICAwere introduced, for easy application of ICA. Further issues
were also briefly mentioned so that readers can follow the status of ICA.
References
Amari S (1998) Natural gradient works efficiently in
learning. Neural Comput 10(2):251–276
Amari S, Cardoso JF (1997) Blind source separation:
semiparametric statistical approach. IEEE Trans Sig-
nal Process 45:2692–2700
Amari S, Chen TP, Cichocki A (1997) Stability analysis of
learning algorithms for blind source separation.
Neural Networks 10(8):1345–1351
Amari S, Cichocki A (1998) Adaptive blind signal pro-
cessing – neural network approaches. Proc IEEE
(Special Issue on Blind Identification and Estima-
tion) 86(10):2026–2048
Amari S, Cichocki A, Yang HH (1996) A new learning
algorithm for blind signal separation. In: Touretzky
DS, Mozer MC, Hasselmo ME (eds) Advances in
neural information processing systems (NIPS),
vol 8. MIT Press, Cambridge, pp 757–763
Attias H (1999) Independent factor analysis. Neural
Comput 11:803–851
Bach F, Jordan MI (2002) Kernel independent compo-
nent analysis. JMLR 3:1–48
Bach FR, Jordan MI (2003) Beyond independent com-
ponents: trees and clusters. JMLR 4:1205–1233
Bell A, Sejnowski T (1995) An information maximisation
approach to blind separation and blind deconvolu-
tion. Neural Comput 7:1129–1159
Belouchrani A, Abed‐Merain K, Cardoso JF, Moulines E
(1997) A blind source separation technique using
second order statistics. IEEE Trans Signal Process
45:434–444
458 13 Independent Component Analysis
Cardoso JF (1989) Source separation using higher‐ordermoments. In: Proceedings of the IEEE international
conference on acoustics, speech, and signal proces-
sing (ICASSP), Paris, France, 23–26 May 1989
Cardoso JF (1997) Infomax and maximum likelihood for
source separation. IEEE Signal Process Lett 4(4):
112–114
Cardoso JF, Laheld BH (1996) Equivariant adaptive
source separation. IEEE Trans Signal Process 44
(12):3017–3030
Cardoso JF, Souloumiac A (1993) Blind beamforming for
non Gaussian signals. IEE Proc‐F 140(6):362–370
Chang C, Ding Z, Yau SF, Chan FHY (2000) A matrix‐pencil approach to blind separation of colored non-
stationary signals. IEEE Trans Signal Process 48(3):
900–907
Choi S (2002) Adaptive differential decorrelation: a nat-
ural gradient algorithm. In: Proceedings of the in-
ternational conference on artificial neural networks
(ICANN), Madrid, Spain. Lecture notes in comput-
er science, vol 2415. Springer, Berlin, pp 1168–1173
Choi S (2003) Differential learning and random walk
model. In: Proceedings of the IEEE international
conference on acoustics, speech, and signal proces-
sing (ICASSP), IEEE, Hong Kong, pp 724–727
Choi S, Cichocki A (2000a) Blind separation of nonsta-
tionary and temporally correlated sources from
noisy mixtures. In: Proceedings of IEEE workshop
on neural networks for signal processing, IEEE,
Sydney, Australia. pp 405–414
Choi S, Cichocki A (2000b) Blind separation of nonsta-
tionary sources in noisy mixtures. Electron Lett
36(9):848–849
Choi S, Cichocki A, Amari S (2000) Flexible independent
component analysis. J VLSI Signal Process 26(1/2):
25–38
Choi S, Cichocki A, Belouchrani A (2002) Second order
nonstationary source separation. J VLSI Signal Pro-
cess 32:93–104
Choi S, Cichocki A, Park HM, Lee SY (2005) Blind
source separation and independent component
analysis: a review. Neural Inf Process Lett Rev 6(1):
1–57
Cichocki A, Amari S (2002) Adaptive blind signal and
image processing: learning algorithms and applica-
tions. Wiley, Chichester
Cichocki A, Unbehauen R (1996) Robust neural net-
works with on‐line learning for blind identification
and blind separation of sources. IEEE Trans Circ
Syst Fund Theor Appl 43:894–906
Comon P (1994) Independent component analysis, a
new concept? Signal Process 36(3):287–314
Fukunaga K (1990) An introduction to statistical pattern
recognition. Academic, New York
Golub GH, Loan CFV (1993) Matrix computations,
2nd edn. Johns Hopkins, Baltimore
Haykin S (2000) Unsupervised adaptive filtering: blind
source separation. Prentice‐Hall
Hyvarinen A (1999) Survey on independent component
analysis. Neural Comput Surv 2:94–128
Hyvarinen A, Hoyer P (2000) Emergence of phase‐ andshift‐invariant features by decomposition of natural
images into independent feature subspaces. Neural
Comput 12(7):1705–1720
Hyvarinen A, Karhunen J, Oja E (2001) Independent
component analysis. Wiley, New York
Hyvarinen A, Oja E (1997) A fast fixed‐point algorithmfor independent component analysis. Neural Com-
put 9:1483–1492
Jutten C, Herault J (1991) Blind separation of sources,
part I: an adaptive algorithm based on neuromi-
metic architecture. Signal Process 24:1–10
Karhunen J (1996) Neural approaches to independent
component analysis and source separation. In: Pro-
ceedings of the European symposium on artificial
neural networks (ESANN), Bruges, Belgium,
pp 249–266
Kim S, Choi S (2005) Independent arrays or independent
time courses for gene expression data. In: Proceed-
ings of the IEEE international symposium on circuits
and systems (ISCAS), Kobe, Japan, 23–26 May 2005
Kosko B (1986) Differential Hebbian learning. In: Pro-
ceedings of American Institute of Physics: neural
networks for computing, Snowbird. American Insti-
tute of Physics, Woodbury, pp 277–282
Lee TW (1998) Independent component analysis: theory
and applications. Kluwer
Lee TW, Girolami M, Sejnowski T (1999) Independent
component analysis using an extended infomax
algorithm for mixed sub‐Gaussian and super‐Gaussian sources. Neural Comput 11(2):609–633
Lewicki MS, Sejnowski T (2000) Learning overcomplete
representation. Neural Comput 12(2):337–365
Li Y, Cichocki A, Amari S (2006) Blind estimation of
channel parameters and source components for
EEG signals: a sparse factorization approach. IEEE
Trans Neural Networ 17(2):419–431
Liebermeister W (2002) Linear modes of gene expression
determined by independent component analysis.
Bioinformatics 18(1):51–60
MacKay DJC (1996) Maximum likelihood and covariant
algorithms for independent component analysis.
Technical Report Draft 3.7, University of
Cambridge, Cavendish Laboratory
Matsuoka K, Ohya M, Kawamoto M (1995) A neural net
for blind separation of nonstationary signals. Neural
Networks 8(3):411–419
Miskin JW, MacKay DJC (2001) Ensemble learning for
blind source separation. In: Roberts S, Everson R
(eds) Independent component analysis: principles
and practice. Cambridge University Press,
Cambridge, UK, pp 209–233
Independent Component Analysis 13 459
Molgedey L, Schuster HG (1994) Separation of a mixture
of independent signals using time delayed correla-
tions. Phys Rev Lett 72:3634–3637
Oja E (1995) The nonlinear PCA learning rule and signal
separation – mathematical analysis. Technical Re-
port A26, Helsinki University of Technology, Labo-
ratory of Computer and Information Science
Pearlmutter B, Parra L (1997) Maximum likelihood
blind source separation: a context‐sensitive gener-alization of ICA. In: Mozer MC, Jordan MI, Petsche
T (eds) Advances in neural information processing
systems (NIPS), vol 9. MIT Press, Cambridge,
pp 613–619
Pham DT (1996) Blind separation of instantaneous
mixtures of sources via an independent component
analysis. IEEE Trans Signal Process 44(11):
2768–2779
Plumbley MD (2003) Algorithms for nonnegative inde-
pendent component analysis. IEEE Trans Neural
Network 14(3):534–543
Stone JV (2004) Independent component analysis: a
tutorial introduction. MIT Press, Cambridge
Stone JV, Porrill J, Porter NR, Wilkinson IW (2002)
Spatiotemporal independent component analysis
of event‐related fMRI data using skewed probability
density functions. NeuroImage 15(2):407–421
Tong L, Soon VC, Huang YF, Liu R (1990) AMUSE: a
new blind identification algorithm. In: Proceed-
ings of the IEEE international symposium on cir-
cuits and systems (ISCAS), IEEE, New Orleans,
pp 1784–1787
Welling M, Weber M (2001) A constrained EM algorithm
for independent component analysis. Neural Com-
put 13:677–689