handbook of natural computing || independent component analysis

13 Independent ComponentAnalysis

G. Rozenbe# Springer

Seungjin ChoiPohang University of Science and Technology, Pohang, South Korea

[email protected]

1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
2
Why Independent Components? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
3
Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
4
Natural Gradient Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
5
Flexible ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
6
Differential ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
7
Nonstationary Source Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
8
Spatial, Temporal, and Spatiotemporal ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
9
Algebraic Methods for BSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
10
Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
11
Further Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
12
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
rg et al. (eds.), Handbook of Natural Computing, DOI 10.1007/978-3-540-92910-9_13,

-Verlag Berlin Heidelberg 2012

436 13 Independent Component Analysis

Abstract

Independent component analysis (ICA) is a statistical method, the goal of which is to

decompose multivariate data into a linear sum of non-orthogonal basis vectors with coeffi-

cients (encoding variables, latent variables, and hidden variables) being statistically indepen-

dent. ICA generalizes widely used subspace analysis methods such as principal component

analysis (PCA) and factor analysis, allowing latent variables to be non-Gaussian and

basis vectors to be non-orthogonal in general. ICA is a density-estimation method where a linear

model is learned such that the probability distribution of the observed data is best captured, while

factor analysis aims at bestmodeling the covariance structure of the observed data.We beginwith

a fundamental theory and present various principles and algorithms for ICA.

1 Introduction

Independent component analysis (ICA) is a widely used multivariate data analysis method

that plays an important role in various applications such as pattern recognition, medical image

analysis, bioinformatics, digital communications, computational neuroscience, and so on.

ICA seeks a decomposition of multivariate data into a linear sum of non-orthogonal basis

vectors with coefficients being statistically as independent as possible.

We consider a linear generative model where m-dimensional observed data x 2 m is

assumed to be generated by a linear combination of n basis vectors fai 2 mg,x ¼ a1s1 þ a2s2 þ � � � þ ansn ð1Þ

where fsi 2 g are encoding variables, representing the extent to which each basis vector is

used to reconstruct the data vector. Given N samples, the model (> Eq. 1) can be written in a

compact form

X ¼ AS ð2Þwhere X ¼ ½xð1Þ; . . . ; xðNÞ� 2 m�N is a data matrix, A ¼ ½a1; . . . ; an� 2 m�n is a basis

matrix, and S ¼ ½sð1Þ; . . . ; sðNÞ� 2 n�N is an encoding matrix with s(t) ¼ [s1(t), . . .,sn(t)]⊤.

Dual interpretation of basis encoding in the model (>Eq. 2) is given as follows:

� When columns in X are treated as data points in m-dimensional space, columns in A are

considered as basis vectors and each column in S is encoding that represents the extent to

which each basis vector is used to reconstruct data vector.

� Alternatively, when rows in X are data points in N-dimensional space, rows in S corre-

spond to basis vectors and each row in A represents encoding.

A strong application of ICA is a problem of blind source separation (BSS), the goal of which

is to restore sources S (associated with encodings) without the knowledge of A, given the data

matrix X. ICA and BSS have often been treated as an identical problem since they are closely

related to each other. In BSS, the matrix A is referred to as amixing matrix. In practice, we find

a linear transformation W, referred to as a demixing matrix, such that the rows of the output

matrix

Y ¼ WX ð3Þ

Independent Component Analysis 13 437

are statistically independent. Assume that sources (rows of S) are statistically independent. In

such a case, it is well known that WA becomes a transparent transformation when the rows of

Y are statistically independent. The transparent transformation is given by WA ¼ PL, whereP is a permutation matrix and L is a nonsingular diagonal matrix involving scaling. This

transparent transformation reflects two indeterminacies in ICA (Comon 1994): (1) scaling

ambiguity; (2) permutation ambiguity. In other words, entries of Y correspond to scaled and

permuted entries of S.

Since Jutten and Herault’s first solution (Jutten and Herault 1991) to ICA, various

methods have been developed so far, including a neural network approach (Cichocki and

Unbehauen 1996), information maximization (Bell and Sejnowski 1995), natural gradient (or

relative gradient) learning (Amari et al. 1996; Cardoso and Laheld 1996; Amari and Cichocki

1998), maximum likelihood estimation (Pham 1996; MacKay 1996; Pearlmutter and Parra

1997; Cardoso 1997), and nonlinear principal component analysis (PCA) (Karhunen 1996;

Oja 1995; Hyvarinen and Oja 1997). Several books on ICA (Lee 1998; Hyvarinen et al. 2001;

Haykin 2000; Cichocki and Amari 2002; Stone 1999) are available, serving as a good resource

for a thorough review and tutorial on ICA. In addition, tutorial papers on ICA (Hyvarinen

1999; Choi et al. 2005) are useful resources.

This chapter begins with a fundamental idea, emphasizing why independent components

are sought. Then, well-known principles to tackle ICA are introduced, leading to an objective

function to be optimized. The natural gradient algorithm for ICA is explained. We also

elucidate how we incorporate nonstationarity or temporal information into the standard

ICA framework.

2 Why Independent Components?

PCA is a popular subspace analysis method that has been used for dimensionality reduction and

feature extraction. Given a data matrix X 2 m�N , the covariance matrix Rxx is computed by

Rxx ¼ 1

NXHX>

where H ¼ IN�N � 1N1N1

>N is the centering matrix, where IN�N is the N � N identity matrix

and 1N ¼ ½1; . . . ; 1�> 2 N . The rank-n approximation of the covariance matrix Rxx is of

the form

Rxx � ULU>

where U 2 m�n contains n eigenvectors associated with n largest eigenvalues of Rxx in its

columns and the corresponding eigenvalues are in the diagonal entries of L (diagonal matrix).

Then, principal components z(t) are determined by projecting data points x(t) onto these

eigenvectors, leading to

zðtÞ ¼ U>xðtÞor in a compact form

Z ¼ U>X

It is well known that rows of Z are uncorrelated with each other.

. Fig. 1

Two-dimensional data with two main arms are fitted by two different basis vectors: (a) principal

component analysis (PCA) makes the implicit assumption that the data have a Gaussian

distribution and determines the optimal basis vectors that are orthogonal, which are not

efficient at representing non-orthogonal distributions; (b) independent component analysis

(ICA) does not require that the basis vectors be orthogonal and considers non-Gaussian

distributions, which is more suitable in fitting more general types of distributions.


ICA generalizes PCA in the sense that latent variables (components) are non-Gaussian and

A is allowed to be a non-orthogonal transformation, whereas PCA considers only orthogonal

transformation and implicitly assumes Gaussian components. > Figure 1 shows a simple

example, emphasizing the main difference between PCA and ICA.

A core theorem that plays an important role in ICA is presented. It provides a fundamental

principle for various unsupervised learning algorithms for ICA and BSS.

Theorem 1 (Skitovich–Darmois) Let {s1, s2, . . . , sn} be a set of independent random variables.

Consider two random variables x1 and x2 which are linear combinations of {si},

y1 ¼ a1s1 þ � � � þ ansny2 ¼ b1s1 þ � � � þ bnsn

ð4Þ

where {ai} and {bi} are real constants. If y1 and y2 are statistically independent, then each variablesi for which aibi 6¼ 0 is Gaussian.

Consider the linear model (> Eq. 2) form¼ n. Throughout this chapter, the simplest case,

wherem ¼ n (square mixing), is considered. The global transformation can be defined as G¼WA, where A is the mixing matrix and W is the demixing matrix. With this definition, the

output y(t) can be written as

yðtÞ ¼ WxðtÞ ¼ GsðtÞ ð5ÞIt is assumed that both A and W are nonsingular, hence G is nonsingular. Under this

assumption, one can easily see that if {yi(t)} are mutually independent non-Gaussian signals,

then invoking >Theorem 1, G has the following decomposition

G ¼ PL ð6ÞThis explains how ICA performs BSS.


3 Principles

The task of ICA is to estimate the mixing matrix A or its inverse W ¼ A�1 (referred to as the

demixing matrix) such that elements of the estimate y ¼ A�1x ¼ Wx are as independent as

possible. For the sake of simplicity, the index t is often left out if the time structure does not

have to be considered. In this section, four different principles are reviewed: (1) maximum

likelihood estimation; (2) mutual information minimization; (3) information maximization;

and (4) negentropy maximization.

3.1 Maximum Likelihood Estimation

Suppose that sources s are independent with marginal distributions qi(si)

qðsÞ ¼Yni¼1

qiðsiÞ ð7Þ

In the linear model, x ¼ As, a single factor in the likelihood function is given by

pðxjA; qÞ ¼Z

pðxjs;AÞqðsÞds

¼Z Yn

j¼1

d xj �Xni¼1

Ajisi

!Yni¼1

qiðsiÞdsð8Þ

¼ jdetAj�1Yni¼1

qi

Xnj¼1

A�1ij xj

!ð9Þ

Then, we have

pðxjA; qÞ ¼ jdetAj�1rðA�1xÞ ð10Þ

The log-likelihood is written as

log pðxjA; qÞ ¼ � log jdetAj þ log qðA�1xÞ ð11Þwhich can be also written as

log pðxjW ; qÞ ¼ log jdetW j þ log pðyÞ ð12Þwhere W ¼ A�1 and y is the estimate of s with the true distribution q(�) replaced by a

hypothesized distribution p(�). Since sources are assumed to be statistically independent,

(>Eq. 12) is written as

log pðxjW ; qÞ ¼ log jdetW j þXni¼1

log piðyiÞ ð13Þ

The demixing matrix W is determined by

bW ¼ argmaxW

log jdetW j þXni¼1

log piðyiÞ( )

ð14Þ


It is well known that maximum likelihood estimation is equivalent to Kullback matching

where the optimal model is estimated by minimizing the Kullback–Leibler (KL) divergence

between the empirical distribution and the model distribution. Consider KL divergence from

the empirical distribution ~pðxÞ to the model distribution py(x) ¼ p(x jA,q)

KL½ ~pðxÞjjpyðxÞ� ¼Z

~pðxÞ log ~pðxÞpyðxÞ

dx ¼ �Hð~pÞ �Z

~pðxÞ log pyðxÞdx ð15Þ

where Hð~pÞ ¼ � R ~pðxÞ log~pðxÞdx is the entropy of ~p. Given a set of data points, {x1, . . .,xN}

drawn from the underlying distribution p(x), the empirical distribution ~pðxÞ puts probability1N

on each data point, leading to

~pðxÞ ¼ 1

N

XNt¼1

dðx � xt Þ ð16Þ

It follows from (> Eq. 15) that

argminy

KL½ ~pðxÞjjpyðxÞ� � argmaxy

log pyðxÞh i~p ð17Þ

where �h i~p represents the expectation with respect to the distribution ~p. Plugging (>Eq. 16)

into the right-hand side of (>Eq. 15), leads to

log pyðxÞh i~p ¼1

N

Z Xt¼1

Ndðx � xtÞ log pyðxÞdx ¼ 1

N

XNt¼1

log pyðxt Þ ð18Þ

Apart from the scaling factor 1N, this is just the log-likelihood function. In other words,

maximum likelihood estimation is obtained from the minimization of (> Eq. 15).

3.2 Mutual Information Minimization

Mutual information is a measure for statistical independence. Demixing matrix W is learned

such that the mutual information of y ¼ Wx is minimized, leading to the following objective

function:

jmi ¼Z

pðyÞ log pðyÞQni¼1 piðyiÞ

� �dy

¼ �HðyÞ �Xni¼1

log piðyiÞ* +

y

ð19Þ

where H(�) represents the entropy, that is,

HðyÞ ¼ �Z

pðyÞ log pðyÞdy ð20Þ

and h�iy denotes the statistical average with respect to the distribution p(y). Note that

pðyÞ ¼ pðxÞj detW j . Thus, the objective function (>Eq. 19) is given by

jmi ¼ � log jdetW j �Xni¼1

log piðyiÞ� � ð21Þ


where hlogp(x)i is left out since it does not depend on parametersW. For online learning, only

instantaneous value is taken into consideration, leading to

jmi ¼ � log jdetW j �Xni¼1

log piðyiÞ ð22Þ

3.3 Information Maximization

Infomax (Bell and Sejnowski 1995) involves the maximization of the output entropy z ¼ g(y)

where y ¼ Wx and g(�) is a squashing function (e.g., g iðyiÞ ¼ 11þe�yi

). It was shown that

Infomax contrast maximization is equivalent to the minimization of KL divergence between

the distribution of y¼Wx and the distribution pðsÞ ¼Qni¼1 piðsiÞ. In fact, Infomax is nothing

but mutual information minimization in the CA framework.

The Infomax contrast function is given by

jI ðW Þ ¼ HðgðWxÞÞ ð23Þwhere g(y)¼ [g1(y1), . . .,gn(yn)]

⊤. If gi(·) is differentiable, then it is the cumulative distribution

function of some probability density function qi(�),

giðyiÞ ¼Z yi

�1qiðsiÞdsi

Let us choose a squashing function gi(yi)

giðyiÞ ¼1

1þ e�yið24Þ

where g ið�Þ : ! ð0; 1Þ is a monotonically increasing function.

Let us consider an n-dimensional random vector bs, the joint distribution of which is

factored into the product of marginal distributions:

qðbsÞ ¼Yni¼1

qiðbsiÞ ð25Þ

Then g iðbsiÞ is distributed uniformly on (0,1), since gi(�) is the cumulative distribution

function of bsi . Define u ¼ gðbsÞ ¼ ½g1ðbs1Þ; . . . ; gnðbsnÞ�>, which is distributed uniformly on

(0,1)n.

Define v ¼ g(Wx). Then, the Infomax contrast function is rewritten as

jIðW Þ ¼ HðgðWxÞÞ¼ HðvÞ¼ �

ZpðvÞ log pðvÞdv

¼ �Z

pðvÞ log pðvÞQni¼1 1ð0;1ÞðviÞ

� �dv

¼ �KL½vku�¼ �KL½gðWxÞku�

ð26Þ


where 1(0,1)(�) denotes a uniform distribution on (0,1). Note that KL divergence is invariant

under an invertible transformation f,

KL½ f ðuÞkf ðvÞ� ¼ KL½ukv�¼ KL½ f �1ðuÞkf �1ðvÞ�

Therefore, we have

jIðW Þ ¼ �KL½gðWxÞku�¼ �KL½Wxkg�1ðuÞ�¼ �KL½Wxkbs � ð27Þ

It follows from (>Eq. 27) that maximizing jI ðW Þ (Infomax principle) is identical to the

minimization of the KL divergence between the distribution of the output vector y ¼Wx and

the distribution bs whose entries are statistically independent. In other words, Infomax is

equivalent to mutual information minimization in the framework of ICA.

3.4 Negentropy Maximization

Negative entropy or negentropy is a measure of distance to Gaussianity, yielding a larger

value for a random variable whose distribution is far from Gaussian. Negentropy is

always nonnegative and vanishes if and only if the random variable is Gaussian. Negentropy

is defined as

JðyÞ ¼ HðyGÞ �HðyÞ ð28Þwhere HðyÞ ¼ f� log pðyÞg represents the entropy and yG is a Gaussian random vector

whose mean vector and covariance matrix are the same as y. In fact, negentropy is the KL

divergence of p(yG) from p(y), that is

JðyÞ ¼ KL pðyÞkpðyGÞ� ¼Z

pðyÞ log pðyÞpðyGÞ dy

ð29Þ

leading to (> Eq. 28).

Let us discover a relation between negentropy and mutual information. To this end, we

consider mutual information I(y):

IðyÞ ¼ Iðy1; . . . ; ynÞ

¼Xni¼1

HðyiÞ �HðyÞ

¼Xni¼1

HðyGi Þ �Xni¼1

JðyiÞ þ JðyÞ �HðyGÞ

¼ JðyÞ �Xni¼1

JðyiÞ þ1

2log

Qni¼1½Ryy �iidetRyy

� �ð30Þ

where Ryy ¼ fyy>g and [Ryy]ii denotes the ith diagonal entry of Ryy .


Assume that y is already whitened (decorrelated), that is, Ryy ¼ I. Then the sum of

marginal negentropies is given byXni¼1

JðyiÞ ¼ JðyÞ � IðyÞ þ 1

2log

Qni¼1½Ryy �iidetRyy

� �|fflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflffl}

0

¼ �HðyÞ �Z

pðyÞ log pðyGÞdy � IðyÞ

¼ �HðxÞ � log j detW j � IðyÞ �Z

pðyÞ log pðyGÞdy

ð31Þ

Invoking Ryy ¼ I, (> Eq. 31) becomesXni¼1

JðyiÞ ¼ �IðyÞ �HðxÞ � log jdetW j þ 1

2log jdetRyy j ð32Þ

Note that

1

2log jdetRyy j ¼ 1

2log jdetðWRxxW

>Þj ð33Þ

Therefore, we have Xni¼1

JðyiÞ ¼ �IðyÞ ð34Þ

where irrelevant terms are omitted. It follows from (>Eq. 34) that maximizing the sum of

marginal negentropies is equivalent to minimizing the mutual information.

4 Natural Gradient Algorithm

In > Sect. 3, four different principles lead to the same objective function

j ¼ � log jdetW j �Xni¼1

log piðyiÞ ð35Þ

That is, ICA boils down to learning W, which minimizes (> Eq. 35),

bW ¼ argminW

� log j detW j �Xni¼1

log piðyiÞ( )

ð36Þ

An easy way to solve (> Eq. 36) is the gradient descent method, which gives a learning

algorithm for W that has the form

DW ¼ ��@j

@W

¼ �� W�> � ’ðyÞx>� � ð37Þ

where �>0 is the learning rate and ’(y) ¼ [’1(y1),. . . ,’n(yn)]⊤ is the negative score function

whose ith element ’i(yi) is given by

’iðyiÞ ¼ � d log piðyiÞdyi

ð38Þ


A popular ICA algorithm is based on natural gradient (Amari 1998), which is known to be

efficient since the steepest descent direction is used when the parameter space is on a

Riemannian manifold. The natural gradient ICA algorithm can be derived (Amari et al. 1996).

Invoking (> Eq. 38), we have

d �Xni¼1

log qiðyiÞ( )

¼Xni¼1

’iðyiÞdyi ð39Þ

¼ ’>ðyÞdy ð40Þwhere ’(y) ¼ [’1(y1). . .’n(yn)]

⊤ and dy is given in terms of dW as

dy ¼ dWW�1y ð41ÞDefine a modified coefficient differential dV as

dV ¼ dWW�1 ð42ÞWith this definition, we have

d �Xni¼1

log qiðyiÞ( )

¼ ’>ðyÞdVy ð43Þ

We calculate an infinitesimal increment of log jdetW j, then we have

dflog j detW jg ¼ trfdVg ð44Þwhere tr{·} denotes the trace that adds up all diagonal elements.

Thus, combining > Eqs. 43 and > 44 gives

dj ¼ ’>ðyÞdVy � trfdVg ð45ÞThe differential in (> Eq. 45) is in terms of the modified coefficient differential matrix dV.

Note that dV is a linear combination of the coefficient differentials dWij. Thus, as long as dW is

nonsingular, dV represents a valid search direction to minimize (> Eq. 35), because dV spans

the same tangent space of matrices as spanned by dW. This leads to a stochastic gradient

learning algorithm for V given by

DV ¼ ��dj

dV

¼ �fI � ’ðyÞy>gð46Þ

Thus the learning algorithm for updating W is described by

DW ¼ �DVW

¼ � I � ’ðyÞy>� �W

ð47Þ

5 Flexible ICA

The optimal nonlinear function ’i(yi) is given by (> Eq. 38). However, it requires knowledge

of the probability distributions of sources that are not available. A variety of hypoth-

esized density models has been used. For example, for super-Gaussian source signals, the

unimodal or hyperbolic-Cauchy distribution model (MacKay 1996) leads to the nonlinear

function given by

’iðyiÞ ¼ tanhðbyiÞ ð48Þ


Such a sigmoid function was also used by Bell and Sejnowski (1995). For sub-Gaussian source

signals, the cubic nonlinear function ’iðyiÞ ¼ y3i has been a favorite choice. For mixtures of

sub- and super-Gaussian source signals, according to the estimated kurtosis of the extracted

signals, the nonlinear function can be selected from two different choices (Lee et al. 1999).

The flexible ICA (Choi et al. 2000) incorporates the generalized Gaussian density model

into the natural gradient ICA algorithm, so that the parameterized nonlinear function

provides flexibility in learning. The generalized Gaussian probability distribution is a set of

distributions parameterized by a positive real number a, which is usually referred to as the

Gaussian exponent of the distribution. The Gaussian exponent a controls the ‘‘peakiness’’ of

the distribution. The probability density function (PDF) for a generalized Gaussian is

described by

pðy; aÞ ¼ a2lG 1

a

� e�jylja ð49Þ

where G(x) is the Gamma function given by

GðxÞ ¼Z 1

0

tx�1e�t dt ð50Þ

Note that if a¼ 1, the distribution becomes the standard ‘‘Laplacian’’ distribution. If a¼ 2, the

distribution is a standard normal distribution (see > Fig. 2).

For a generalized Gaussian distribution, the kurtosis can be expressed in terms of the

Gaussian exponent, given by

ka ¼G 5

a

�G 1

a

�G2 3

a

� � 3 ð51Þ

. Fig. 2

The generalized Gaussian distribution is plotted for several different values of Gaussian

exponent, a = 0.8, 1, 2, 4.

. Fig. 3

The plot of kurtosis ka versus Gaussian exponent a: (a) for a leptokurtic signal; (b) for a

platykurtic signal.


The plots of kurtosis ka versus the Gaussian exponent a for leptokurtic and platykurtic signals

are shown in > Fig. 3.

From the parameterized generalized Gaussian density model, the nonlinear function in the

algorithm (> Eq. 47) is given by

’iðyiÞ ¼d log piðyiÞ

dyi

¼ jyijai�1sgnðyiÞ

ð52Þ

where sgn(yi) is the signum function of yi.

Note that for ai ¼ 1, ’i(yi) in (> Eq. 38) becomes a signum function (which can also be

derived from the Laplacian density model for sources). The signum nonlinear function is

favorable for the separation of speech signals since natural speech is often modeled as a

Laplacian distribution. Note also that for ai¼ 4, ’i(yi) in (> Eq. 38) becomes a cubic function,

which is known to be a good choice for sub-Gaussian sources.

In order to select a proper value of the Gaussian exponent ai, we estimate the kurtosis of

the output signal yi and select the corresponding ai from the relationship in > Fig. 3. The

kurtosis of yi, ki, can be estimated via the following iterative algorithm:

kiðt þ 1Þ ¼ M4iðt þ 1ÞM2

2iðt þ 1Þ � 3 ð53Þ

where

M4iðt þ 1Þ ¼ ð1� dÞM4iðtÞ þ djyiðtÞj4 ð54Þ

M2iðt þ 1Þ ¼ ð1� dÞM2iðtÞ þ djyiðtÞj2 ð55Þ

where d is a small constant, say 0.01.

In general, the estimated kurtosis of the demixing filter output does not exactly match

the kurtosis of the original source. However, it provides an idea about whether the estimated

source is a sub-Gaussian signal or a super-Gaussian signal. Moreover, it was shown (Cardoso


1997; Amari and Cardoso 1997) that the performance of the source separation is not degraded

even if the hypothesized density does not match the true density. For these reasons, we suggest

a practical method where only several different forms of nonlinear functions are used.

6 Differential ICA

In a wide sense, most ICA algorithms based on unsupervised learning belong to the

Hebb-type rule or its generalization with adopting nonlinear functions. Motivated by

the differential Hebb rule (Kosko 1986) and differential decorrelation (Choi 2002, 2003),

we introduce an ICA algorithm employing differential learning and natural gradients, which

leads to a differential ICA algorithm. A random walk model is first introduced for latent

variables, in order to show that differential learning is interpreted as the maximum likelihood

estimation of a linear generative model. Then the detailed derivation of the differential ICA

algorithm is presented.

6.1 Random Walk Model for Latent Variables

Given a set of observation data, {x(t)}, the task of learning the linear generative model

(>Eq. 1) under the constraint of the latent variables being statistically independent, is a

semiparametric estimation problem. The maximum likelihood estimation of basis vectors {ai}

involves a probabilistic model for latent variables that are treated as nuisance parameters.

In order to show a link between the differential learning and the maximum likelihood

estimation, a random walk model for latent variables si(t) is considered, which is a simple

Markov chain, that is,

siðtÞ ¼ siðt � 1Þ þ eiðtÞ ð56Þwhere the innovation ei(t) is assumed to have zero mean with a density function qi(ei(t)).In addition, innovation sequences {ei(t)} are assumed to be mutually independent white

sequences, that is, they are spatially independent and temporally white as well.

Let us consider latent variables si(t) over an N-point time block. The vector si can be

defined as

si ¼ ½sið0Þ; . . . ; siðN � 1Þ�> ð57ÞThen the joint PDF of si can be written as

piðsiÞ ¼ pi sið0Þ; . . . ; siðN � 1Þð Þ

¼YN�1

t¼0

piðsiðtÞjsiðt � 1ÞÞ ð58Þ

where si(t) ¼ 0 for t < 0 and the statistical independence of the innovation sequences is taken

into account.

It follows from the random walk model (> Eq. 56) that the conditional probability density

of si(t) given its past samples can be written as

piðsiðtÞjsiðt � 1ÞÞ ¼ qiðeiðtÞÞ ð59Þ


Combining (>Eq. 58) and (>Eq. 59) leads to

piðsiÞ ¼YN�1

t¼0

qiðeiðtÞÞ

¼YN�1

t¼0

qi s0iðtÞ � ð60Þ

where s 0i ðtÞ ¼ siðtÞ � siðt � 1Þ, which is the first-order approximation of the differentiation.

Taking the statistical independence of the latent variables and (> Eq. 60) into account,

then the joint density pðs1; . . . ; snÞ can be written as

pðs1; . . . ; snÞ ¼Yni¼1

piðsiÞ

¼YN�1

t¼0

Yni¼1

qi s 0i ðtÞ � ð61Þ

The factorial model given in (> Eq. 61) will be used as an optimization criterion in deriving

the differential ICA algorithm.

6.2 Algorithm

Denote a set of observation data by

x ¼ fx1; . . . ; xng ð62Þwhere

xi ¼ ½xið0Þ; . . . ; xiðN � 1Þ�> ð63ÞThen the normalized log-likelihood is given by

1

Nlog pðxjAÞ ¼ � log detAj j þ 1

Nlog pðs1; . . . ; snÞ

¼ � log detAj j þ 1

N

XN�1

t¼0

Xni¼1

log qiðs 0i ðtÞÞð64Þ

The inverse of A is denoted by W ¼ A�1. The estimate of latent variables is denoted by

y(t) ¼ Wx(t). With these defined variables, the objective function (i.e., the negative normal-

ized log-likelihood) is given by

jdi ¼ � 1

Nlog pðxjAÞ

¼ � log detWj j � 1

N

XN�1

t¼0

Xni¼1

log qiðy 0i ðtÞÞ

ð65Þ

where si is replaced by its estimate yi and y 0i ðtÞ ¼ yiðtÞ � yiðt � 1Þ (the first-order approxi-

mation of the differentiation).

For online learning, the sample average is replaced by the instantaneous value. Hence the

online version of the objective function (>Eq. 65) is given by

jdi ¼ � log detWj j �Xni¼1

log qiðy 0i ðtÞÞ ð66Þ


Note that objective function (>Eq. 66) is slightly different from (> Eq. 35) used in the

conventional ICA based on the minimization of mutual information or the maximum

likelihood estimation.

We derive a natural gradient learning algorithm that finds a minimum of (>Eq. 66). To

this end, we follow the way that was discussed in (Amari et al. 1997; Amari 1998; Choi et al.

2000). The total differential d jdiðW Þ due to the change dW is calculated

djdi ¼ jdiðW þ dW Þ �jdiðW Þ

¼ d � log detWj jf g þ d �Xni¼1


( )ð67Þ

Define

’iðy 0i Þ ¼ � d log qiðy 0

i Þdy 0

i

ð68Þ

and construct a vector ’(y 0) ¼ ’1(y01),. . .,’n(y

0n)

⊤.

With this definition, we have

d �Xni¼1


( )¼Xni¼1

’iðy 0i ðtÞÞdy 0

i ðtÞ

¼ ’>ðy 0ðtÞÞdy 0ðtÞð69Þ

One can easily see that

d � log detWj jf g ¼ tr dWW�1� � ð70Þ

Define a modified differential matrix dV by

dV ¼ dWW�1 ð71ÞThen, with this modified differential matrix, the total differential djdiðW Þ is computed as

djdi ¼ �tr dVf g þ ’>ðy 0ðtÞÞdVy 0ðtÞ ð72ÞA gradient descent learning algorithm for updating V is given by

V ðt þ 1Þ ¼ V ðtÞ � �tdjdi

dV

¼ �t I � ’ðy 0ðtÞÞy 0>ðtÞ� � ð73Þ

Hence, it follows from the relation (>Eq. 71) that the updating rule for W has the form

W ðt þ 1Þ ¼ W ðtÞ þ �t I � ’ðy 0ðtÞÞy 0>ðtÞ� �W ðtÞ ð74Þ

7 Nonstationary Source Separation

So far, we assumed that sources are stationary random processes where the statistics do not

vary over time. In this section, we show how the natural gradient ICA algorithm is modified to

handle nonstationary sources. As in Matsuoka et al. (1995), the following assumptions are

made in this section.


AS1 The mixing matrix A has full column rank.

AS2 Source signals {si(t)} are statistically independent with zero mean. This implies that the

covariance matrix of the source signal vector, Rs(t) ¼ E{s(t)s⊤(t)}, is a diagonal matrix,

that is,

RsðtÞ ¼ diagfr1ðtÞ; . . . ; rnðtÞg ð75Þwhere ri(t) ¼ E{s2i (t)} and E denotes the statistical expectation operator.

AS3riðtÞrjðtÞ (i, j ¼ 1, . . . , n and i 6¼ j) are not constant with time.

It should be pointed out that the first two assumptions (AS1, AS2) are common in most

existing approaches to source separation, however the third assumption (AS3) is critical in this

chapter. For nonstationary sources, the third assumption is satisfied and it allows one to

separate linear mixtures of sources via second-order statistics (SOS).

For stationary source separation, the typical cost function is based on the mutual infor-

mation, which requires knowledge of the underlying distributions of the sources. Since the

probability distributions of the sources are not known in advance, most ICA algorithms rely

on hypothesized distributions (e.g., see Choi et al. 2000 and references therein). Higher-order

statistics (HOS) should be incorporated either explicitly or implicitly.

For nonstationary sources, Matsuoka et al. have shown that the decomposition (>Eq. 6) is

satisfied if cross-correlations E{yi(t)yj(t)} (i, j ¼ 1, . . . , n, i 6¼ j) are zeros at any time instant t,

provided that the assumptions (AS1)–(AS3) are satisfied. To eliminate cross-correlations, the

following cost function was proposed in Matsuoka et al. (1995),

j Wð Þ ¼ 1

2

Xni¼1

log Efy2i ðtÞg � log det E yðtÞy>ðtÞ� � �( )ð76Þ

where det(�) denotes the determinant of a matrix. The cost function given in (> Eq. 76) is a

nonnegative function, which takes minima if and only if E{yi(t)yj(t)} ¼ 0, for i, j ¼ 1, . . . , n,

i 6¼ j. This is the direct consequence of Hadamard’s inequality, which is summarized below.

Theorem 2 (Hadamard’s inequality) Suppose K ¼ [kij] is a nonnegative definite symmetric

n � n matrix. Then,

detðKÞ Yni¼1

kii ð77Þ

with equality iff kij ¼ 0, for i 6¼ j.

Take the logarithm on both sides of (> Eq. 77) to obtainXni¼1

log kii � log detðKÞ 0 ð78Þ

Replacing the matrix K by E{y(t)y⊤(t)}, one can easily see that the cost function (> Eq. 76) has

the minima iff E{yi(t)yj(t)} ¼ 0, for i, j ¼ 1, . . . ,n and i 6¼ j.

We compute

d log detðEfyðtÞy>ðtÞgÞ� � ¼ 2d log detWf g þ d log detCðtÞf g¼ 2tr W�1dW

� �þ d log detCðtÞf g ð79Þ

Define a modified differential matrix dV as

dV ¼ W�1dW ð80Þ


Then, we have

dXni¼1

log Efy2i ðtÞg( )

¼ 2Efy>ðtÞL�1ðtÞdVyðtÞg ð81Þ

Similarly, we can derive the learning algorithm for W that has the form

DW ðtÞ ¼ �t I � L�1ðtÞyðtÞy>ðtÞ� �W ðtÞ

¼ �tL�1ðtÞ LðtÞ � yðtÞy>ðtÞ� �

W ðtÞ ð82Þ

8 Spatial, Temporal, and Spatiotemporal ICA

ICA decomposition, X ¼ AS, inherently has duality. Considering the data matrix X 2 m�N

where each of its rows is assumed to be a time course of an attribute, ICA decomposition

produces n independent time courses. On the other hand, regarding the data matrix in the

form of X⊤, ICA decomposition leads to n independent patterns (for instance, images in fMRI

or arrays in DNA microarray data).

The standard ICA (where X is considered) is treated as temporal ICA (tICA). Its dual

decomposition (regarding X⊤) is known as spatial ICA (sICA). Combining these two ideas

leads to spatiotemporal ICA (stICA). These variations of ICA were first investigated in Stone

et al. (2000). Spatial ICA or spatiotemporal ICA were shown to be useful in fMRI image

analysis (Stone et al. 2000) and gene expression data analysis (Liebermeister 2002; Kim and

Choi 2005).

Suppose that the singular value decomposition (SVD) of X is given by

X ¼ U DV T ¼ U D1=2� �

V D1=2� �T¼ eUeVT ð83Þ

where U 2 m�n, D 2 n�n, and V 2 N�n for n min(m,N).

8.1 Temporal ICA

Temporal ICA finds a set of independent time courses and a corresponding set of dual

unconstrained spatial patterns. It embodies the assumption that each row vector in eV>

consists of a linear combination of n independent sequences, that is, eV> ¼ eATST ; where

ST 2 n�N has a set of n independent temporal sequences of length N, and eAT 2 n�n is an

associated mixing matrix.

Unmixing by YT ¼ WTeV> whereWT ¼ PeA�1

T , leads one to recover the n dual patterns ATassociated with the n independent time courses, by calculating AT ¼ eUW�1

T , which is a

consequence of eX ¼ ATYT ¼ eU eV> ¼ eUW�1T YT .

8.2 Spatial ICA

Spatial ICA seeks a set of independent spatial patterns SS and a corresponding set of dual

unconstrained time courses AS. It embodies the assumption that each row vector in eU> is


composed of a linear combination of n independent spatial patterns, that is, eU> ¼ eASSS , where

SS 2 n�m contains a set of n independent m-dimensional patterns and eAS 2 n�n is an

encoding variable matrix (mixing matrix).

Define YS ¼ WSeU> where WS is a permuted version of eA�1

S . With this definition, the n

dual time courses AS 2 N�n associated with the n independent patterns are computed as

AS ¼ eVW�1S , since eX> ¼ ASYS ¼ eUeVT ¼ eVW�1

S YS . Each column vector of AS corresponds to

a temporal mode.

8.3 Spatiotemporal ICA

In linear decomposition sICA enforces independence constraints over space to find a set of

independent spatial patterns, whereas tICA embodies independence constraints over time to

seek a set of independent time courses. Spatiotemporal ICA finds a linear decomposition by

maximizing the degree of independence over space as well as over time, without necessarily

producing independence in either space or time. In fact, it allows a trade-off between the

independence of arrays and the independence of time courses.

Given eX ¼ eUeVT, stICA finds the following decomposition:eX ¼ S>S LST ð84Þwhere SS 2 n�m contains a set of n independent m-dimensional patterns, ST 2 n�N has a

set of n independent temporal sequences of lengthN, and L is a diagonal scaling matrix. There

exist two n � n mixing matrices, WS and WT, such that SS ¼ WSeU> and ST ¼ WT

eV>. Thefollowing relation eX ¼ S>S LST

¼ eUW>S LWT

eV>

¼ eUeVT

ð85Þ

implies that WS⊤LWT ¼ I, which leads to

WT ¼ W�TS L�1 ð86Þ

Linear transforms, WS and WT , are found by jointly optimizing objective functions

associated with sICA and tICA. That is, the objective function for stICA has the form

jstICA ¼ ajsICA þ ð1� aÞjtICA ð87Þwhere jsICA and jtICA could be Infomax criteria or log-likelihood functions and a defines the

relative weighting for spatial independence and temporal independence. More details on

stICA can be found in Stone et al. (2002).

9 Algebraic Methods for BSS

Up to now, online ICA algorithms in a framework of unsupervised learning have been

introduced. In this section, several algebraic methods are explained for BSS where matrix

decomposition plays a critical role.


9.1 Fundamental Principle of Algebraic BSS

Algebraic methods for BSS often make use of the eigen-decomposition of correlation matrices

or cumulant matrices. Exemplary algebraic methods for BSS include FOBI (Cardoso 1989),

AMUSE (Tong et al. 1990), JADE (Cardoso and Souloumiac 1993), SOBI (Belouchrani et al.

1997), and SEONS (Choi et al. 2002). Some of these methods (FOBI and AMUSE) are based

on simultaneous diagonalization of two symmetric matrices. Methods such as JADE, SOBI,

and SEONS make use of joint approximate diagonalization of multiple matrices (more than

two). The following theorem provides a fundamental principle for algebraic BSS, justifying

why simultaneous diagonalization of two symmetric data matrices (one of them is assumed to

be positive definite) provides a solution to BSS.

Theorem 3 Let L1;D1 2 n�n be diagonal matrices with positive diagonal entries and

L2;D2 2 n�n be diagonal matrices with nonzero diagonal entries. Suppose that G 2 n�n

satisfies the following decompositions:

D1 ¼ GL1G> ð88Þ

D2 ¼ GL2G> ð89Þ

Then the matrix G is the generalized permutation matrix, i.e., G = PL if D�11 D2 and L�1

1 L2 have

distinct diagonal entries.

Proof It follows from (>Eq. 88) that there exists an orthogonal matrix Q such that

GL12

1

� �¼ D

12

1

� �Q ð90Þ

Hence,

G ¼ D12

1QL�1

2

1 ð91ÞSubstitute (>Eq. 91) into (> Eq. 89) to obtain

D�11 D2 ¼ QL�1

1 L2Q> ð92Þ

Since the right-hand side of (>Eq. 92) is the eigen-decomposition of the left-hand side of

(>Eq. 92), the diagonal elements of D�11 D2 and L�1

1 L2 are the same. From the assumption

that the diagonal elements of D�11 D2 and L�1

1 L2 are distinct, the orthogonal matrix Q must

have the form Q¼ PC, whereC is a diagonal matrix whose diagonal elements are either +1 or

�1. Hence, we have

G ¼ D12

1PCL�12

1

¼ PP>D12

1PCL�12

1

¼ PL

ð93Þ

where

L ¼ P>D12

1PCL�1

2

1

which completes the proof.


9.2 AMUSE

As an example of >Theorem 3, we briefly explain AMUSE (Tong et al. 1990), where a BSS

solution is determined by simultaneously diagonalizing the equal-time correlation matrix of x

(t) and a time-delayed correlation matrix of x(t).

It is assumed that sources {si(t)} (entries of s(t)) are uncorrelated stochastic processes with

zero mean, that is,

fsiðtÞsjðt � tÞg ¼ dijgiðtÞ ð94Þwhere dij is the Kronecker delta and gi(t) are distinct for i ¼ 1,. . .,n, given t. In other words,

the equal-time correlation matrix of the source, Rssð0Þ ¼ fsðtÞs>ðtÞg, is a diagonal matrix

with distinct diagonal entries. Moreover, a time-delayed correlation matrix of the source,

RssðtÞ ¼ fsðtÞs>ðt � tÞg, is diagonal as well, with distinct nonzero diagonal entries.

It follows from (>Eq. 2) that the correlation matrices of the observation vector x(t) satisfy

Rxxð0Þ ¼ ARssð0ÞA> ð95ÞRxxðtÞ ¼ ARssðtÞA> ð96Þ

for some nonzero time-lag t and both Rss(0) and Rss(t) are diagonal matrices since sources are

assumed to be spatially uncorrelated.

Invoking >Theorem 3, one can easily see that the inverse of the mixing matrix, A�1, canbe identified up to its rescaled and permuted version by the simultaneous diagonalization of

Rxx(0) and Rxx(t), provided that R�1ss (0)Rss(t) has distinct diagonal elements. In other words, a

linear transformationW is determined such that Ryy(0) and Ryy(t) of the output y(t) ¼Wx(t)

are simultaneously diagonalized:

Ryyð0Þ ¼ ðWAÞRssð0ÞðWAÞ>

RyyðtÞ ¼ ðWAÞRssðtÞðWAÞ>

It follows from >Theorem 3 that WA becomes the transparent transformation.

9.3 Simultaneous Diagonalization

We explain how two symmetric matrices are simultaneously diagonalized by a linear

transformation. More details on simultaneous diagonalization can be found in Fukunaga

(1990). Simultaneous diagonalization consists of two steps (whitening followed by a unitary

transformation):

1. First, the matrix Rxx(0) is whitened by

zðtÞ ¼ D�1

2

1 U>1 xðtÞ ð97Þ

where D1 and U1 are the eigenvalue and eigenvector matrices of Rxx(0),

Rxxð0Þ ¼ U1D1U>1 ð98Þ

Then, we have

Rzzð0Þ ¼ D�1

2

1 U>1 Rxxð0ÞU1D

�12

1 ¼ Im

RzzðtÞ ¼ D�1

2

1 U>1 RxxðtÞU1D

�12

1


2. Second, a unitary transformation is applied to diagonalize the matrix Rzz(t). The eigen-decomposition of Rzz(t) has the form

RzzðtÞ ¼ U2D2U>2 ð99Þ

Then, y(t) ¼ U>2 z(t) satisfies

Ryyð0Þ ¼ U>2 Rzzð0ÞU2 ¼ Im

RyyðtÞ ¼ U>2 RzzðtÞU2 ¼ D2

Thus, both matrices Rxx(0) and Rxx(t) are simultaneously diagonalized by a linear transform

W ¼ U>2 D

�12

1 U>1 . It follows from >Theorem 3 that W ¼ U>

2 D�1

2

1 U>1 is a valid demixing

matrix if all the diagonal elements of D2 are distinct.

9.4 Generalized Eigenvalue Problem

The simultaneous diagonalization of two symmetric matrices can be carried out without going

through two-step procedures. From the discussion in > Sect. 9.3, we have

WRxxð0ÞW> ¼ In ð100ÞWRxxðtÞW> ¼ D2 ð101Þ

The linear transformation W, which satisfies (> Eq. 100) and (>Eq. 101) is the eigenvector

matrix of Rxx�1(0)Rxx(t) (Fukunaga 1990). In other words, the matrix W is the generalized

eigenvector matrix of the pencil Rxx(t)�lRxx(0) (Molgedey and Schuster 1994).

Recently Chang et al. (2000) proposed thematrix pencil method for BSSwhere they exploited

Rxx(t1) and Rxx(t2) for t1 6¼ t2 6¼ 0. Since the noise vector was assumed to be temporally white,

two matrices Rxx(t1) and Rxx(t2) are not theoretically affected by the noise vector, that is,

Rxxðt1Þ ¼ ARssðt1ÞA> ð102ÞRxxðt2Þ ¼ ARssðt2ÞA> ð103Þ

Thus, it is obvious that we can find an estimate of the demixing matrix that is not sensitive to

the white noise. A similar idea was also exploited in Choi and Cichocki (2000a, b).

In general, the generalized eigenvalue decomposition requires the symmetric-definite

pencil (one matrix is symmetric and the other is symmetric and positive definite). However,

Rxx(t2)�lRxx(t1) is not symmetric-definite, which might cause a numerical instability prob-

lem that results in complex-valued eigenvectors.

The set of all matrices of the form R1�lR2 with l 2 is said to be a pencil. Frequently we

encounter the case where R1 is symmetric and R2 is symmetric and positive definite. Pencils of

this variety are referred to as symmetric-definite pencils (Golub and Loan 1993).

Theorem 4 (Golub and Loan 1993, p. 468) If R1 � lR2 is symmetric-definite, then there exists

a nonsingular matrix U ¼ u1; . . . ; un½ � such that

U>R1U ¼ diag g1ðt1Þ; . . . ; gnðt1Þf g ð104ÞU>R2U ¼ diag g1ðt2Þ; . . . ; gnðt2Þf g ð105Þ

Moreover R1ui ¼ liR2ui for i ¼ 1, . . . , n, and li ¼ giðt1Þgiðt2Þ.


It is apparent from >Theorem 4 that R1 should be symmetric and R2 should be symmetric

and positive definite so that the generalized eigenvector U can be a valid solution if {li} aredistinct.

10 Software

Avariety of ICA software is available. ICACentral (URL:http://www.tsi.enst.fr/icacentral/) was

created in 1999 to promote research on ICA and BSS by means of public mailing lists, a

repository of data sets, a repository of ICA/BSS algorithms, and so on. ICA Central might be

the first place where you can find data sets and ICA algorithms. In addition, here are some

widely used software packages.

� ICALAB Toolboxes (http://www.bsp.brain.riken.go.jp/ICALAB/): ICALAB is an ICA

Matlab software toolbox developed in the Laboratory for Advanced Brain Signal Proces-

sing in the RIKEN Brain Science Institute, Japan. It consists of two independent packages,

including ICALAB for signal processing and ICALAB for image processing, and each

package contains a variety of algorithms.

� FastICA (http://www.cis.hut.fi/projects/ica/fastica/): This is the FastICA Matlab package

that implements fast fixed-point algorithms for non-Gaussianity maximization

(Hyvarinen et al. 2001). It was developed in the Helsinki University of Technology,

Finland, and other environments (R, C++, Physon) are also available.

� Infomax ICA (http://www.cnl.salk.edu/�tewon/ica_cnl.html): Matlab and C codes for Bell

and Sejnowski’s Infomax algorithm (Bell and Sejnowski 1995) and extended Infomax (Lee

1998) where a parametric density model is incorporated into Infomax to handle both

super-Gaussian and sub-Gaussian sources.

� EEGLAB (http://sccn.ucsd.edu/eeglab/): EEGLAB is an interactive Matlab toolbox for

processing continuous and event-related EEG, MEG, and other electrophysiological

data using ICA, time/frequency analysis, artifact rejection, and several modes of data

visualization.

� ICA: DTU Toolbox (http://isp.imm.dtu.dk/toolbox/ica/): ICA: DTU Toolbox is a collection

of ICA algorithms that includes: (1) icaML, which is an efficient implementation of

Infomax; (2) icaMF, which is an iterative algorithm that offers a variety of possible source

priors and mixing matrix constraints (e.g., positivity) and can also handle over- and

under-complete mixing; and (3) icaMS, which is a ‘‘one shot’’ fast algorithm that requires

time correlation between samples.

11 Further Issues

� Overcomplete representation: Overcomplete representation enforces the latent space di-

mension n to be greater than the data dimension m in the linear model (> Eq. 1).

Sparseness constraints on latent variables are necessary to learn fruitful representation

(Lewicki and Sejnowski 2000).

� Bayesian ICA: Bayesian ICA incorporates uncertainty and prior distributions of latent

variables in the model (>Eq. 1). Independent factor analysis (Attias 1999) is pioneering

work along this direction. The EM algorithm for ICA was developed in Welling and

http://www.tsi.enst.fr/icacentral/

http://www.bsp.brain.riken.go.jp/ICALAB/

http://www.cis.hut.fi/projects/ica/fastica/

http://sccn.ucsd.edu/eeglab/

http://isp.imm.dtu.dk/toolbox/ica/

http://www.cnl.salk.edu/~tewon/ica_cnl.html


Weber (2001) and a full Bayesian ICA (also known as ensemble learning) was developed in

Miskin and MacKay (2001).

� Kernel ICA: Kernel methods were introduced to consider statistical independence in

reproducing kernel Hilbert space (Bach and Jordan 2002), developing kernel ICA.

� Nonnegative ICA: Nonnegativity constraints were imposed on latent variables, yielding

nonnegative ICA (Plumbley 2003). Rectified Gaussian priors can also be used in Bayesian

ICA to handle nonnegative latent variables.

� Sparseness: Sparseness is another important characteristic of sources, besides indepen-

dence. Sparse component analysis is studied in Lee et al. (2006).

� Beyond ICA: Independent subspace analysis (Hyvarinen and Hoyer 2000) and tree-

dependent component analysis (Bach and Jordan 2003) generalizes ICA, allowing intra-

dependence structures in feature subspaces or clusters.

12 Summary

ICA has been successfully applied to various applications of machine learning, pattern

recognition, and signal processing. A brief overview of ICA has been presented, starting

from fundamental principles on learning a linear latent variable model for parsimonious

representation. Natural gradient ICA algorithms were derived in the framework of maximum

likelihood estimation, mutual information minimization, Infomax, and negentropy maximi-

zation. We have explained flexible ICA where generalized Gaussian density was adopted such

that a flexible nonlinear function was incorporated into the natural gradient ICA algorithm.

Equivariant nonstationary source separation was presented in the framework of natural

gradient as well. Differential learning was also adopted to incorporate a temporal structure

of sources. We also presented a core idea and various methods for algebraic source separation.

Various software packages for ICAwere introduced, for easy application of ICA. Further issues

were also briefly mentioned so that readers can follow the status of ICA.

References

Amari S (1998) Natural gradient works efficiently in

learning. Neural Comput 10(2):251–276

Amari S, Cardoso JF (1997) Blind source separation:

semiparametric statistical approach. IEEE Trans Sig-

nal Process 45:2692–2700

Amari S, Chen TP, Cichocki A (1997) Stability analysis of

learning algorithms for blind source separation.

Neural Networks 10(8):1345–1351

Amari S, Cichocki A (1998) Adaptive blind signal pro-

cessing – neural network approaches. Proc IEEE

(Special Issue on Blind Identification and Estima-

tion) 86(10):2026–2048

Amari S, Cichocki A, Yang HH (1996) A new learning

algorithm for blind signal separation. In: Touretzky

DS, Mozer MC, Hasselmo ME (eds) Advances in

neural information processing systems (NIPS),

vol 8. MIT Press, Cambridge, pp 757–763

Attias H (1999) Independent factor analysis. Neural

Comput 11:803–851

Bach F, Jordan MI (2002) Kernel independent compo-

nent analysis. JMLR 3:1–48

Bach FR, Jordan MI (2003) Beyond independent com-

ponents: trees and clusters. JMLR 4:1205–1233

Bell A, Sejnowski T (1995) An information maximisation

approach to blind separation and blind deconvolu-

tion. Neural Comput 7:1129–1159

Belouchrani A, Abed‐Merain K, Cardoso JF, Moulines E

(1997) A blind source separation technique using

second order statistics. IEEE Trans Signal Process

45:434–444


Cardoso JF (1989) Source separation using higher‐ordermoments. In: Proceedings of the IEEE international

conference on acoustics, speech, and signal proces-

sing (ICASSP), Paris, France, 23–26 May 1989

Cardoso JF (1997) Infomax and maximum likelihood for

source separation. IEEE Signal Process Lett 4(4):

112–114

Cardoso JF, Laheld BH (1996) Equivariant adaptive

source separation. IEEE Trans Signal Process 44

(12):3017–3030

Cardoso JF, Souloumiac A (1993) Blind beamforming for

non Gaussian signals. IEE Proc‐F 140(6):362–370

Chang C, Ding Z, Yau SF, Chan FHY (2000) A matrix‐pencil approach to blind separation of colored non-

stationary signals. IEEE Trans Signal Process 48(3):

900–907

Choi S (2002) Adaptive differential decorrelation: a nat-

ural gradient algorithm. In: Proceedings of the in-

ternational conference on artificial neural networks

(ICANN), Madrid, Spain. Lecture notes in comput-

er science, vol 2415. Springer, Berlin, pp 1168–1173

Choi S (2003) Differential learning and random walk

model. In: Proceedings of the IEEE international

conference on acoustics, speech, and signal proces-

sing (ICASSP), IEEE, Hong Kong, pp 724–727

Choi S, Cichocki A (2000a) Blind separation of nonsta-

tionary and temporally correlated sources from

noisy mixtures. In: Proceedings of IEEE workshop

on neural networks for signal processing, IEEE,

Sydney, Australia. pp 405–414

Choi S, Cichocki A (2000b) Blind separation of nonsta-

tionary sources in noisy mixtures. Electron Lett

36(9):848–849

Choi S, Cichocki A, Amari S (2000) Flexible independent

component analysis. J VLSI Signal Process 26(1/2):

25–38

Choi S, Cichocki A, Belouchrani A (2002) Second order

nonstationary source separation. J VLSI Signal Pro-

cess 32:93–104

Choi S, Cichocki A, Park HM, Lee SY (2005) Blind

source separation and independent component

analysis: a review. Neural Inf Process Lett Rev 6(1):

1–57

Cichocki A, Amari S (2002) Adaptive blind signal and

image processing: learning algorithms and applica-

tions. Wiley, Chichester

Cichocki A, Unbehauen R (1996) Robust neural net-

works with on‐line learning for blind identification

and blind separation of sources. IEEE Trans Circ

Syst Fund Theor Appl 43:894–906

Comon P (1994) Independent component analysis, a

new concept? Signal Process 36(3):287–314

Fukunaga K (1990) An introduction to statistical pattern

recognition. Academic, New York

Golub GH, Loan CFV (1993) Matrix computations,

2nd edn. Johns Hopkins, Baltimore

Haykin S (2000) Unsupervised adaptive filtering: blind

source separation. Prentice‐Hall

Hyvarinen A (1999) Survey on independent component

analysis. Neural Comput Surv 2:94–128

Hyvarinen A, Hoyer P (2000) Emergence of phase‐ andshift‐invariant features by decomposition of natural

images into independent feature subspaces. Neural

Comput 12(7):1705–1720

Hyvarinen A, Karhunen J, Oja E (2001) Independent

component analysis. Wiley, New York

Hyvarinen A, Oja E (1997) A fast fixed‐point algorithmfor independent component analysis. Neural Com-

put 9:1483–1492

Jutten C, Herault J (1991) Blind separation of sources,

part I: an adaptive algorithm based on neuromi-

metic architecture. Signal Process 24:1–10

Karhunen J (1996) Neural approaches to independent

component analysis and source separation. In: Pro-

ceedings of the European symposium on artificial

neural networks (ESANN), Bruges, Belgium,

pp 249–266

Kim S, Choi S (2005) Independent arrays or independent

time courses for gene expression data. In: Proceed-

ings of the IEEE international symposium on circuits

and systems (ISCAS), Kobe, Japan, 23–26 May 2005

Kosko B (1986) Differential Hebbian learning. In: Pro-

ceedings of American Institute of Physics: neural

networks for computing, Snowbird. American Insti-

tute of Physics, Woodbury, pp 277–282

Lee TW (1998) Independent component analysis: theory

and applications. Kluwer

Lee TW, Girolami M, Sejnowski T (1999) Independent

component analysis using an extended infomax

algorithm for mixed sub‐Gaussian and super‐Gaussian sources. Neural Comput 11(2):609–633

Lewicki MS, Sejnowski T (2000) Learning overcomplete

representation. Neural Comput 12(2):337–365

Li Y, Cichocki A, Amari S (2006) Blind estimation of

channel parameters and source components for

EEG signals: a sparse factorization approach. IEEE

Trans Neural Networ 17(2):419–431

Liebermeister W (2002) Linear modes of gene expression

determined by independent component analysis.

Bioinformatics 18(1):51–60

MacKay DJC (1996) Maximum likelihood and covariant

algorithms for independent component analysis.

Technical Report Draft 3.7, University of

Cambridge, Cavendish Laboratory

Matsuoka K, Ohya M, Kawamoto M (1995) A neural net

for blind separation of nonstationary signals. Neural

Networks 8(3):411–419

Miskin JW, MacKay DJC (2001) Ensemble learning for

blind source separation. In: Roberts S, Everson R

(eds) Independent component analysis: principles

and practice. Cambridge University Press,

Cambridge, UK, pp 209–233


Molgedey L, Schuster HG (1994) Separation of a mixture

of independent signals using time delayed correla-

tions. Phys Rev Lett 72:3634–3637

Oja E (1995) The nonlinear PCA learning rule and signal

separation – mathematical analysis. Technical Re-

port A26, Helsinki University of Technology, Labo-

ratory of Computer and Information Science

Pearlmutter B, Parra L (1997) Maximum likelihood

blind source separation: a context‐sensitive gener-alization of ICA. In: Mozer MC, Jordan MI, Petsche

T (eds) Advances in neural information processing

systems (NIPS), vol 9. MIT Press, Cambridge,

pp 613–619

Pham DT (1996) Blind separation of instantaneous

mixtures of sources via an independent component

analysis. IEEE Trans Signal Process 44(11):

2768–2779

Plumbley MD (2003) Algorithms for nonnegative inde-

pendent component analysis. IEEE Trans Neural

Network 14(3):534–543

Stone JV (2004) Independent component analysis: a

tutorial introduction. MIT Press, Cambridge

Stone JV, Porrill J, Porter NR, Wilkinson IW (2002)

Spatiotemporal independent component analysis

of event‐related fMRI data using skewed probability

density functions. NeuroImage 15(2):407–421

Tong L, Soon VC, Huang YF, Liu R (1990) AMUSE: a

new blind identification algorithm. In: Proceed-

ings of the IEEE international symposium on cir-

cuits and systems (ISCAS), IEEE, New Orleans,

pp 1784–1787

Welling M, Weber M (2001) A constrained EM algorithm

for independent component analysis. Neural Com-

put 13:677–689

handbook of natural computing || independent component analysis

Documents