blind signal separation: statistical...

Blind Signal Separation: Statistical Principles

JEAN-FRANCOIS CARDOSO,MEMBER, IEEE

Invited Paper

Blind signal separation (BSS) and independent component anal-ysis (ICA) are emerging techniques of array processing and dataanalysis that aim to recover unobserved signals or “sources” fromobserved mixtures (typically, the output of an array of sensors),exploiting only the assumption of mutual independence betweenthe signals. The weakness of the assumptions makes it a powerfulapproach, but it requires us to venture beyond familiar second-order statistics. The objectives of this paper are to review some ofthe approaches that have been recently developed to address thisexciting problem, to illustrate how they stem from basic principles,and to show how they relate to each other.

Keywords—Adaptive arrays, adaptive signal processing, arraysignal processing, asymptotic stability, blind source separation,higher order statistics, independent component analysis, maximumlikelihood estimation, minimum entropy methods, signal analysis,signal processing antennas, signal representations, signal restora-tion, signal separation, source separation, unsupervised learning.

I. INTRODUCTION

Blind signal separation (BSS) consists of recoveringunobserved signals or “sources” from several observedmixtures. Typically the observations are obtained at theoutput of a set of sensors, where each sensor receives adifferent combination of the source signals. The adjective“blind” stresses the fact that 1) the source signals arenot observed and 2) no information is available about themixture. This is a sound approach when modeling thetransfer from the sources to the sensors is too difficult; it isunavoidable when no prior information is available aboutthe transfer. The lack of prior knowledge about the mixtureis compensated by a statistically strong but often physicallyplausible assumption of independence between the sourcesignals. The so-called blindness should not be understoodnegatively; the weakness of the prior information is pre-cisely the strength of the BSS model, making it a versatiletool for exploiting the spatial diversity provided by an arrayof sensors. Promising applications can already be found inthe processing of communications signals, e.g., [23], [63],

Manuscript received June 26, 1997; revised July 7, 1998.The author is with C.N.R.S. (Center National de la Recherche

Scientifique) and E.N.S.T. (Ecole Nationale Superieure desTelecommunications), 75634 Paris Cedex 13, France.

Publisher Item Identifier S 0018-9219(98)06978-3.

[67], [5], in biomedical signals1 like ECG [30] and EEG[50], [46], in monitoring [37], [35], or as an alternative toprincipal component analysis (e.g., see [46], [9], [52], and[6]).

The simplest BSS model assumes the existence ofindependent signals and the observation ofas many mixtures these mixtures beinglinear and instantaneous, i.e., foreach

This is represented compactly by the mixing equation

(1)

where is an column vectorcollecting the source signals, vector similarly collectsthe observed signals, and the square “mixingmatrix” contains the mixture coefficients. Here, as in thefollowing, denotes transposition. The BSS problem existsin recovering the source vector using only the observeddata the assumption of independence between theentries of the input vector , and possibly some priorinformation about the probability distribution of the inputs.It can be formulated as the computation of anseparating matrix whose output

(2)

is an estimate of the vector of the source signals.Fig. 2 shows an example of adaptive separation of (real)

digital communications signals. A two-sensor array collectscomplex-valued noisy mixtures of two source signals whichboth have a constant modulus envelope. Successful separa-tion upon adaptation is evidenced by the restoration of theconstant modulus at each output.

In Fig. 2, the underlying BSS algorithm optimizes a costfunction composed of two penalty terms: one for correlationbetween outputs and one for deviation of the modulus froma constant value. This example introduces several points tobe developed below.

— A penalty term involving only pairwise decorrelation(second-order statistics) would not lead to separa-

1See http://www.cnl.salk.edu/˜tewon/ica_cnl.html, the ICA page of theCNL group, for several biomedical applications.

0018–9219/98$10.00 1998 IEEE

PROCEEDINGS OF THE IEEE, VOL. 86, NO. 10, OCTOBER 1998 2009

Fig. 1. Mixing and separating. Unobserved signalssss; observa-tions xxx; estimated source signalsyyy:

Fig. 2. Outputsy1(t) (top row) andy2(t) (bottom row) whenusing the separating matrix obtained after adaptation based on 0,50, 100, 150, 200 samples of a2�2 mixture of constant modulussignals. Each subplot is in the complex plane. The clustering aroundcircles shows the restoration of the constant modulus property.

tion. Source separation must go beyond second-orderstatistics (see Section II).

— Source separation can be obtained by optimizinga contrast function, i.e., a scalar measure of somedistributional property of the output The constantmodulus property is very specific; more general con-trast functions are based on other measures, suchas entropy, mutual independence, high-order decor-relations, divergence between the joint distributionof and some model, etc. Contrast functions arediscussed in Section III, where it is seen how theyrelate to each other and can be derived from themaximum likelihood principle.

— Fast adaptation is possible, even with simple algo-rithms (see Sections IV and V), and blind identifi-cation can be accurate even with a small number ofsamples (see Section VI on performance analysis).

The basic BSS model can be extended in several direc-tions. Considering, for instance, more sensors than sources,noisy observations, and complex signals and mixtures, oneobtains the standard narrow band array processing/beam-forming model. Another extension is to consider con-volutive mixtures: this results in a multichannel blinddeconvolution problem. These extensions are of practicalimportance, but this paper is restricted to the simplestmodel (i.e., real signals, as many sensors as sources,nonconvolutive mixtures, and noise-free observations) be-cause it captures the essence of the BSS problem andbecause our objective is to present the basic statisticalideas, focusing on principles. Nonetheless, some referencesto papers addressing more general models are provided inthe last section.

The paper is organized as follows: Section II dis-cusses blind identifiability; Sections III and IV presentcontrast functions and estimating functions, starting frominformation–theoretic ideas and moving to suboptimal high-

order approximations; adaptive algorithms are describedin Section V; Section VI addresses some performanceissues.

II. CAN IT BE DONE? MODELING AND IDENTIFIABILITY

When is source separation possible? To what extent canthe source signals be recovered? What are the propertiesof the source signals allowing for partial or complete blindrecovery? These issues are addressed in this section.

A. The BSS Model

Source separation exploits primarily spatial diversity, thatis, the fact that different sensors receive different mixturesof the sources. Spectral diversity, if it exists, could alsobe exploited, but the approach of source separation isessentially spatial, meaning that it looks for structure acrossthe sensors, not across time. The consequence of ignoringany time structure is that the information contained in thedata is represented exhaustively by the sample distributionof the observed vector (as depicted graphically in Fig. 3).Thus, BSS becomes the problem of identifying the prob-ability distribution of a vector , given a sampledistribution. In this perspective, the statistical model hastwo components: the mixing matrix and the probabilitydistribution of the source vector

— Mixing matrix.The mixing matrix is the parameterof interest. Its columns are assumed to be linearlyindependent so that it is invertible (see [13] for thediscussion of a more general case).

There is something special about having an invert-ible matrix as the unknown parameter, because ma-trices represent linear transformations. Indeed, model(1) is a particular instance of a transformation model.Furthermore, the set of all invertible matricesforms a multiplicative group. This simple fact hasa profound impact on source separation because itallows for the design of algorithms with uniformperformance, i.e., whose behavior is completely inde-pendent of the particular mixture (Sections V-A andVI-C).

— Source distribution.The probability distribution ofeach source is a “nuisance parameter,” meaning thatwe are not primarily interested in it even thoughknowing or estimating these distributions is neces-sary to estimate efficiently the parameter of interest.Even if we say nothing about the distribution of eachsource, we say a lot about their joint distribution bythe key assumption of mutual source independence.If each source is assumed to have aprobability density function (pdf) denoted theindependence assumption has a simple mathematicalexpression. The (joint) pdf of the source vector

is

(3)

2010 PROCEEDINGS OF THE IEEE, VOL. 86, NO. 10, OCTOBER 1998

Fig. 3. Sample distributions of(x1; x2) whenxxx = Asss for fivedifferent transformation matricesA and three pairs of distributionsfor (s1; s2): From left to right: the identity transform, permutationof the sources, sign change, a�=3 rotation, and a generic lineartransform.

i.e., it is the product of the densities for allsources (the marginal densities). Source separationtechniques differ widely by the (explicit or implicit)assumptions made on the individual distributions ofthe sources. There is a whole range of options.

1) The source distributions are known in advance.2) Some features are known (moments, heavy

tails, bounded support, etc.).3) They belong to a parametric family.4) No distribution model is available.

A priori, the stronger the assumption, the narrowerthe applicability. However, well-designed approachesare in fact surprisingly robust even to gross errorsin modeling the source distributions, as shown be-low. For ease of exposition, zero mean sources areassumed throughout

i.e., (4)

B. Blind Identifiability

The issue of blind identifiability is to understand to whichextent matrix is determined from the sole distribution ofthe observed vector The answer depends on thedistribution of and on what is known about it.

A square matrix is said to be nonmixing if it has one andonly one nonzero entry in each row and each column. Ifis nonmixing then is a copy of , i.e., its entriesare identical to those of up to permutations and changesof scales and signs. Source separation is achieved if sucha copy is obtained. When the distribution ofis unknown,one cannot expect to do any better than signal copy, but thesituation is a bit different if some prior information aboutthe distribution of is available. If the sources have distinctdistributions, a possible permutation can be detected; if thescale of a given source is known, the amplitude of thecorresponding column of can be estimated, etc.

Some intuition about identifiability can be gained byconsidering simple examples of 22 mixing. Each row ofFig. 3 shows (sample) distributions of a pair ofindependent variables after various linear transforms. The

columns successively show ,and the effect of a rotation and of a nondescript lineartransform.

Visual inspection of the transformed distribution com-pared to the original one gives a feeling of how wellthe transform matrix can be identified based on theobservation of a mixture. The first row of Fig. 3 shows acase where the second column ofcan be identified onlyup to sign because is symmetrically distributed about theorigin (and therefore has the same distribution as Thesecond row shows a more severe indetermination. Here,

and have the same symmetric distribution, and thetransform can be determined only up to arbitrary changesof sign and a permutation. The last row shows the mostsevere case. Here, and are normally distributed withequal variance so that their joint distribution is invariantunder rotation.

These simple examples suggest thatcan be blindlyidentified—possibly up to some indeterminations inducedby the symmetries in the distribution of the source vec-tor—in the case of known source distributions. However,this knowledge is not necessary; the eye certainly cancapture the distortion in the last columns of Fig. 3, evenwithout reference to the undistorted shapes in first column.This is because the graphical “signature of independence”(the pdf shape in the first column) clearly appears asdistorted in the last colum. This intuition is supported bythe following statement adapted from [25] after a theoremof Darmois (see also [13]).

For a vector of independent entries with at mostone Gaussian entry and for any invertible matrix,if the entries of are independent, then is acopy of ( is nonmixing).

Thus, unless a linear transform is nonmixing, it turns avector of independent entries (at most one being Gaussian)into a vector whose entries are not independent. This isa key result because it means that blind signal separationcan be achieved by restoring statistical independence. Thisis not only a theoretical result about blind identifiability,but it also suggests that BSS algorithms could be devisedby maximizing the independence between the outputs ofa separating matrix. Section III shows that the maximumlikelihood principle does support this idea and leads to aspecific measure of independence.

1) Independence and Decorrelation:Blind separation canbe based on independence but independence cannot bereduced to the simple decorrelation conditions that

for all pairs This is readily seen fromthe fact that there are, by symmetry, only suchconditions (one for each pair of sources) while there areunknown parameters.

Second-order information (decorrelation), however, canbe used to reduce the BSS problem to a simpler form.Assume, for simplicity, that the source signals have unitvariance so that their covariance matrix is the identitymatrix vector is said to be spatially white.

Let be a whitening matrix for that is, isspatially white.

CARDOSO: BLIND SIGNAL SEPARATION 2011

The composite transform is necessarily a rotationmatrix because it relates two spatially white vectorsand

Therefore, whitening or sphering the datareduces the mixture to a rotation matrix. It means that aseparating matrix can be found as a product ,where is a whitening matrix and is a rotation matrix.Note that any further rotation of into preservesspatial whiteness, so that two equivalent approaches toexploiting source decorrelation are 1) findaswith a spatial whitener and a rotation or 2) find

under the whiteness constraint For furtherreference, we write the whiteness constraint as

where (5)

Spatial whiteness imposes constraints, leavingunknown (rotation) parameters to be determined

by other than second-order information. Second-order infor-mation is able to do “about half of the BSS job.”

The prewhitening approach is sensible from an algo-rithmic point of view but it is not necessarily statisticallyefficient (see Section VI-B). Actually, enforcing the white-ness constraint amounts to believing that second-orderstatistics are “infinitely more reliable” than any other kindsof statistics. This is, of course, untrue.

C. Likelihood

This section examines the likelihood of source separationmodels in a simple, graphical way. The likelihood in a givenmodel is the probability of a data set as a function of theparameters of the model. The simple model forvector discussed in Section II-A is parameterized by thepair , which is made of the mixing matrix and thedensity for the source vector The density offor a given pair is classically given by

(6)

If samples of are modeledas independent, thenThus, the normalized (i.e., divided by) log likelihood of

for the parameter pair is

(7)Figs. 5–7 show the likelihood landscape whenis variedwhile is kept fixed. For each figure, indepen-dent realizations of are drawn according tosome pdf and are mixed with a 22matrix to produce samples of Therefore, this data set

follows exactly model (1) with a true mixing matrixand a true source distribution The figures show the loglikelihood when is varied around its true value whilemodel density is kept fixed. These figures illustrate theimpact of the choice of a particular model density.

In each of these figures, the matrix parameterisvaried in two directions in matrix space according to

Fig. 4. Decorrelation leaves an unknown rotation.

Fig. 5. Log likelihood with a slightly misspecified model forsource distribution. Maximum is reached close to the true value.

Fig. 6. Log likelihood with a Gaussian model for source distri-bution. There is no contrast in the skew-symmetric direction.

Fig. 7. Log likelihood with a widely misspecified model forsource distribution; maximum is reached for a mixing system.

, where is the matrix

(8)


(a) (b) (c)

Fig. 8. How the maximum likelihood estimator is misled: (a)true distribution, (b) hypothesized distribution, and (c) estimateddistribution.

This is just a convenient way to generate a neighborhoodof the identity matrix. For small and

(9)

Therefore, and are called symmetric and skew-symmetric parameters, respectively. Each one controls aparticular deviation of away from the identity.

In Fig. 5, the true source distributions and areuniform on [ 1, 1] but the model takes and tobe each a mixture of two normal distributions with samevariance but different means (as in the second column ofFig. 11). True and hypothesized sample distributions of

are displayed in the upper left and right cornersof the plot. Even though an incorrect model is used forthe source distribution the figure shows that thelikelihood is maximal around , i.e., the mostlikely mixing matrix given the data and the model is closeto

In Fig. 6, the true sources are almost binary (see up-per left corner) but a Gaussian model is used:

The figure shows that the likelihoodof does not depend on the skew-symmetricparameter again evidencing the insufficiency of Gaussianmodeling.

In Fig. 7, the sources are modeled as in Fig. 5, but thetrue (and identical) source distributions and are nowmixtures of normal distributions with the same mean butdifferent variances (as in the second column of Fig. 11). Adisaster happens: the likelihood is no longer maximum for

in the vicinity ofActually, if the value of maximizing the likelihood

is used to estimate the source signals asone obtains maximally mixed sources. This is explained inSection III-A and Fig. 8.

The main point of this informal study is the necessityof non-Gaussian modeling (Fig. 6), the possibility of usingonly an approximate model of the sources (Fig. 5), and theexistence of a limit to the misspecification of the sourcemodel (Fig. 7). How wrong can the source distributionmodel be? This is quantified in Section VI-A.

III. CONTRAST FUNCTIONS

This section introduces contrast functions, which are ob-jective functions for source separation. The maximum like-lihood principle is used as a starting point, suggesting sev-eral information–theoretic objective functions (Section III-A) which are then shown to be related to another class

of objective functions based on high-order correlations(Section III-B).

Minimum contrast estimation is a general technique ofstatistical inference [57] which encompasses several tech-niques like maximum likelihood or least squares. It isrelevant for blind deconvolution (see [36] and [11]) and hasbeen introduced in the related BSS problem by Comon in[25]. In both instances, a contrast function is a real functionof a probability distribution. To deal with such functions, aspecial notation is useful: for, a given random variable,

generically denotes a function of, while denotesa function of the distribution of For instance, the meanvalue of is denoted

Contrast functions for source separation (or simply “con-trasts,” for short) are generically denoted They arereal-valued functions of the distribution of the output

and they serve as objectives: they must be designedin such a way that source separation is achieved whenthey reach their minimum value. In other words, a validcontrast function should, for any matrix satisfy

with equality only when is a copy ofthe source signals. Since the mixture can be reduced toa rotation matrix by enforcing the whiteness constraint

(Section II-B), one can also consider orthogonalcontrast functions. These are denoted as and must beminimized under the whiteness constraint

A. Information Theoretic Contrasts

The maximum likelihood (ML) principle leads to severalcontrasts which are expressed via the Kullback divergence.The Kullback divergence between two probability densityfunctions and on is defined as

(10)

whenever the integral exists [27]. The divergence betweenthe distributions of two random vectors and is con-cisely denoted An important property of is that

with equality if and only if and have thesame distribution. Even though is not a distance (it isnot symmetric), it should be understood as a statistical wayof quantifying the closeness of two distributions.

1) Matching Distributions—Likelihood and Infomax:Thelikelihood landscapes displayed in Figs. 5–7 assumes aparticular pdf for the source vector. Denoting arandom vector with distribution simple calculus showsthat

(11)

Therefore, Figs. 5–7 approximately display (up to a con-stant term) minus the Kullback divergence between thedistribution of and the hypothesized distributionof the sources. This shows that the maximum likelihoodprinciple is associated with a contrast function

(12)


and the normalized log likelihood can be seen, via (11), asan estimate of (up to a constant). The ML principlethus says something very simple when applied to the BSSproblem, that is,find matrix such that the distribution of

is as close as possible (in the Kullback divergence)to the hypothesized distribution of the sources.

The instability problem illustrated by Fig. 7 may nowbe understood as follows: in this figure, the likelihoodis maximum when is a rotation, becausethe true source distribution is closer to the hypothesizedsource distribution after it is rotated by As Fig. 8shows, after such a rotation the areas of highest densityof correspond to the points of highest probability of thehypothesized source model.

A different approach to derive the contrast function (12)is very popular among the neural network community.Denote the distribution function

(13)

so that , and denoteAn interpretation of the infomax principle (see [8], [14],and references therein) suggests the contrast function

(14)

where denotes the Shannon entropy (for a ran-dom vector with density this is

with the convention ).This idea can be understood as follows: on one hand,

is uniformly distributed on if has pdf onthe other hand, the uniform distribution has the highestentropy among all distributions on [27]. Therefore,

has the highest entropy when The infomaxidea, however, yields the same contrast as the likelihood

The connection between maximumlikelihood and infomax was noted by several authors (see[56], [18], [49]).

2) Matching the Structure—Mutual Information:The sim-ple likelihood approach described above is based on afixed hypothesis about the distribution of the sources. Thisbecomes a problem if the hypothesized source distributionsdiffer too much from the true ones, as illustrated by Figs. 7and 8. This remark suggests that the observed data shouldbe modeled by adjusting both the unknown systemandthe distributions of the sources. In other words, one shouldminimize the divergence with respect to (viathe distribution of and with respect to themodel distribution of The last minimization problemhas a simple and intuitive theoretical solution. Denotea random vector with 1) independent entries and 2) eachentry distributed as the corresponding entry ofA classicproperty (e.g., see [27]) of is that

(15)

for any vector with independent entries. Sincedoes not depend on (15) shows that is minimizedin by minimizing its second term, i.e., this is

simply achieved by taking for which, so that Having minimized

the likelihood contrast with respect to the sourcedistribution leading to our program is completed ifwe minimize the latter with respect to i.e., if we minimizethe contrast function

(16)

The Kullback divergence between a distributionand the closest distribution with independent entries istraditionally called the mutual information (between theentries of It satisfies with equality ifand only if is distributed as By definition ofthis happens when the entries of are independent. Inother words, measures the independence betweenthe entries of Thus, mutual information apears as thequantitative measure of independence associated to themaximum likelihood principle.

Note further that (because bothand have independent entries). Therefore

(17)

so that the decomposition (15) or (17) of the global dis-tribution matching criterion should beunderstood as

Totalmismatch

Deviation fromindependence

Marginalmismatch

Therefore, maximizing the likelihood with fixed assump-tions about the distributions of the sources amounts tominimizing a sum of two terms: the first term is the true ob-jective (mutual information as a measure of independence)while the second term measures how far the (marginal)distributions of the outputs are from the assumeddistributions.

3) Orthogonal Contrasts:If the mixing matrix has beenreduced to a rotation matrix by whitening as explained inSection II-B, contrast functions like or can stillbe used. The latter takes an interesting alternative formunder the whiteness constraint . In this case, onecan show that is, up to a constant term, equal tothe sum of the Shannon entropies of each output. Thus,under the whiteness constraint, minimizing the mutualinformation between the entries of is equivalent tominimizing the sum of the entropies of the entries of,and we define

(18)

There is a simple interpretation: mixing the entries oftends to increase their entropies. It seems natural to findseparated source signals as those with minimum marginalentropies. It is also interesting to notice that is(up to a constant) the Kullback divergence between thedistribution of and the zero-mean unit-variance normaldistribution. Therefore, minimizing the sum of the marginal


Fig. 9. Gaussianization by mixing. Histograms ofy1 (toprow) and y2 (bottom row) whenyyy is rotated by��=4 for� = 0; 1=4;1=2; 3=4;1: Each subplot also shows the estimatedkurtosisk1 andk2 [defined at (21)] decreasing (in absolute value)upon mixing.

entropies is also equivalent to driving the marginal distribu-tions of as far away as possible from normality. Again,the interpretation is that mixing tends to Gaussianize themarginal distributions so that a separating technique shouldgo in the opposite direction. Fig. 9 is a visual illustrationof the tendency towards normality by mixing.

The first column shows histograms for two indepen-dent variables and with a bimodal distribution and(superimposed to it as a solid line) the best Gaussianapproximation. The following columns show the histogramsafter rotations by steps of going from 0 to wheremixing is maximal. The tendency towards normality is veryapparent.

The entropic form (18) of the mutual information wasused as starting point by Comon [25]; it remains a validcontrast under the weaker constraint thatis a volumepreserving transformation [55].

4) Discussion: The canonical contrast for source separa-tion is the mutual information because it expresses thekey property of source independence and nothing else. Itdoes not include any explicit or implicit assumption aboutthe distributions of the sources. On the other hand, if thesource distributions are known, is more appropriatebecause it directly expresses the fit between data and model.Also, is easier to minimize because its gradient iseasily estimated [see (31)] while estimating the gradient of

is computationally demanding [59]. Even when thesource distributions are unknown, one may use withhypothesized source distributions which only need to beclose enough to the true distributions (Section II-C providesa qualitative explanation, Section VI-A provides a quanti-tative statement, and Section V-B discusses adapting themodel distributions). Another approach is to approximatethe Kullback-based contrasts using high-order statistics, asexamined next.

B. High-Order Approximations

High-order statistics can be used to define contrast func-tions which are simple approximations to those derivedfrom the ML approach. High-order information is mostsimply expressed by using cumulants. The discussion beinglimited to cumulants of orders two and four, only thefollowing definitions are needed. For zero-mean randomvariables second-order cumulants are identical to

second-order moments Cum and fourth-ordercumulants are

(19)Whenever the random variables can be split intotwo mutually independent groups, their cumulant is zero.Therefore, independence beyond second-order decorrela-tion can be tested easily using high-order cumulants.

For simplicity, the following notation for the cumulantsof the elements of a given vectoris used throughout:

Since the source vector has independent entries, all itscross-cumulants vanish

(20)

where is the Kronecker symbol, and where we havedefined the variance and the kurtosis of the th sourceas the second- and fourth-order autocumulants of

(21)

The likelihood contrast is the measure ofmismatch between output distribution and a model sourcedistribution. A cruder measure can be defined from thequadratic mismatch between the cumulants

Are and contrast functions, as introduced in thebeginning of this section? Clearly is not a contrastbecause only expresses the decorrelation betweenthe entries of On the contrary, one can show that

is a contrast if all the sources have known nonzerokurtosis. Even though fourth order information is by itselfsufficient to solve the BSS problem, it is interesting to use

and in conjunction because they jointly provide anapproximation to the likelihood contrast; if and aresymmetrically distributed with distributions that are closeto normal, then

(22)

Room is lacking to discuss the validity of this approxi-mation (which stems from an Edgeworth expansion, seeSection V-B). The point, however, is not to determine howclosely approximates , but rather to follow thesuggestion that second- and fourth-order information couldbe used jointly.


1) Orthogonal Contrasts:We consider cumulant-basedorthogonal contrasts. The orthogonal approach, whichenforces whiteness, i.e., corresponds to replacingthe factor 12 in (22) by an infinite weight (optimalweighting is considered in [19]; see also Section V-B)or is equivalent to minimizing under the whitenessconstraint Simple algebra shows that if

then is equal (up to a constant additiveterm) to

(23)

where we have defined Sincethis contrast function is the expectation of a function ofit is particularly simple to estimate by a sample average.

Recall that the contrast function defined in (12)depends on a source model, i.e., it is defined using a hypo-thetical density for the source distribution. Similarly,the fourth-order approximation requires a hypothesisabout the sources, but it is only a fourth-order hypothesisin the sense that only the kurtosis for each source mustbe specified in definition (23). In the same manner thatminimizing over the source distribution yields themutual information contrast minimizing (whichapproximates ) over the kurtosis of each sourceyields an approximation to the mutual information which is

(24)

This was first obtained by Comon [26] (along a slightlydifferent route) and by Gaetaet al. [40] by approximatingthe likelihood by a Gram–Charlier expansion. This contrastis similar to also in that its first form involves onlyterms measuring the (fourth-order) independence betweenthe entries of Its second form stems from the fact that

is constant if holds (e.g., see [25]).It is also similar to (23) when , which is indeedthe case close to separation.

One benefit of considering fourth-order orthogonal con-trasts like is that they can be optimized by theJacobi technique: the unknown rotation (Section II-B) canbe found as a sequence of 22 rotations applied in sequenceto all pairs , for with the optimal angle at eachstep often being available in close form. Comon [25] hassuch a formula for in the case of real signals.

Independence can also be tested on a smaller subset ofcross-cumulants with

(25)

The motivation for using this specific subset is thatalso is a joint diagonalization criterion, implying that it canbe optimized by the Jacobi technique for which the rotationangles can be found in close form, even in the complex case[22]. A similar technique is described in [31].

Simpler contrasts can be used if the kurtosis of thesources are known. For instance (23) suggests, for negative

Fig. 10. Variation of orthogonal contrast functions (solid:��ICA

;dash-dots:��

JADE; dashes:��

m; dots: ��

4) when two sources

with kurtosis k1 and k2 are rotated between��=2 and �=2:Left: (k1; k2) = (�1;�1); center:(k1; k2) = (�1;0); right:(k1; k2) = (�0:5; 1):

kurtosis a very simple contrast

(26)

(see [21], [53], [47], and [44]). Actually, the condition thatfor all pairs of sources is sufficient for the stationary

points of this orthogonal contrast function to be locallystable (see Section VI-A).

Some properties of the fourth-order contrasts discussedabove are illustrated by Fig. 10, which displays the varia-tion of some orthogonal contrast functions in the two-sourcecase. A 2 1 source vector with kurtosis is rotatedinto by an angle On the left panel,the sources have identical kurtosis ; all thefour contrasts are minimized at integer multiples ofOn the center panel, one source is Gaussian ;the contrasts show smaller variations except for Notethat “knows” that one source has zero kurtosis, thusdistinguishing between even and odd multiplies of Onthe right panel, and , so thatwhich violates the condition for to be a contrast: itsminima become maxima and vice versa. This is the samephenomenon illustrated by Fig. 7.

IV. ESTIMATING FUNCTIONS

By design, all valid contrast functions reach their min-ima at a separating point when the model holds; in thissense, no one is better than another. In practice, however,contrasts are only estimated from a finite data set. Sample-based contrasts depend not on the distribution ofbuton its sampledistribution. Estimation from a finite dataset introduces stochastic errors depending on the availablesamples and also on the contrast function. Thus, a statisticalcharacterization of the minima of sample-based contrastfunctions is needed and will provide a basis for comparingcontrast functions. For this purpose, the notion of estimatingfunction is introduced. It is also closely related to gradientalgorithms for BSS (Section V-A).

A. Relative Gradient

The variation of a contrast function under a lineartransform of may be expressed by defining a “relativegradient.” This specific notion builds on the fact that theparameter of interest is a square matrix.


Fig. 11. Some densities and their associated scores.

Definition: An infinitesimal transform of is, where is a “small” matrix

If is smooth enough, can be expanded as

(27)

with being the partial derivative of withrespect to at These coefficients form anmatrix, denoted as and called the “relative gradient”[21] of at In matrix form, expansion (27) reads

(28)

where is the Euclidean scalar product between matrices

Note that the relative gradient is defined without explicitreference to the possible dependence ofon asThus, it actually characterizes the first order variation ofthe contrast function itself. It is of course possible to relate

to a regular gradient with respect to ifElementary calculus yields

(29)

The notion of “natural gradient” was independently in-troduced by Amari [1]. In general, it is distinct fromthe relative gradient in that the latter is defined in anycontinuous group of transformation while the former isdefined in any smooth statistical model. However, for theBSS model, which as a statistical transformation modelcombines both features, the two ideas yield the same classof algorithms (Section V-A).

1) Score Functions:The source densities ,used in (3) and (7) to define the likelihood of a BSS model,enter into the estimating function via their log-derivatives,the so-called score functions defined as

or (30)

Fig. 11 displays some densities and their associated scorefunctions.

Note that the score for the most basic distribution isthe most basic function: if is a zero-mean unit-varianceGaussian variable then theassociated score is Actually, Gaussian densities

are precisely these densities associated with linear scorefunctions. Thus, the necessity of non-Gaussian modeling(recall Section II) translates into the necessity of consider-ing nonlinear score functions.

2) Relative Gradient of the Likelihood Contrast:At thecore of the BSS contrast functions is , which isassociated with the likelihood given the source densities

Its relative gradient is found in [61] to be

(31)

where is

(32)

with the entry-wise nonlinear function

(33)

collecting the score functions related to each source. Thisis a remarkably simple result; this relative gradient merelyis the expected value of a function of

3) Interpretation: The ML contrast function is min-imum at points where its (relative) gradient cancels (i.e.,by (31)) at those points which are solutions of the matrixequation This is interpreted by examiningthe th entry of this matrix equation. Forwe find which depends only on anddetermines the scale of theth source estimate. Forthe th entry of readsmeaning that the th output should be uncorrelated toa nonlinear version of the th output. Becauseand are nonlinear functions, the conditions for the pairs

and the pair are (in general) not equivalent.Note that if the source signals are modeled as zero-meanunit-variance normal variables, then for all

and [recalling (5)]. Thenis minimum at points where ; we only

obtain the whiteness condition. Again, this is not sufficientto determine a separating solution; score functions must benonlinear (the source model must be non-Gaussian).

The idea of using nonlinear functions to obtain a suffi-cient set of independence conditions can be traced backto the seminal paper of Herault and Jutten [43] (see[45] for an English translation), but the choice of thenonlinear functions was somewhatad hoc. Fety [38] gavean interpretation of the nonlinear functions as amplifiers forthe signals of interest; Bar-Ness also produced early workusing nonlinear functions [7]. However, the ML principlemakes it clear that the nonlinear functions are related via(30) a non-Gaussian model of the source distributions.

B. Estimating Functions

An estimating function for the BSS problem is a functionIt is associated with an estimating

equation

(34)


thus called because being matrix-valued, (34) specifiesa priori as many constraints as unknown parameters in theBSS problem. Many BSS estimates can be characterizedvia an estimating function [17], [2].

A simple instance of an estimating function isused in (5) to express the decorrelation between the entriesof Equation (34) with is equivalentto ; it expresses the empiricalwhiteness of a batch of samples of as opposed tothe actual whiteness, i.e., The estimatingfunction however, is not appropriate for BSS, sincewhitening (or decorrelation) is not sufficient to determinea separating matrix.

The simplest example of an estimating function for BSSis obtained in the ML approach. The gradient of thelikelihood (7) may be shown [61] to cancel at points ,which are characterized exactly by (34) withand , as defined in (32). In other words, maximumlikelihood estimates correspond exactly to the solution ofan estimating equation. This equation is nothing but thesample counterpart of which characterizesthe stationary points of Recall that the latter isobtained [at (11) and (12)] as a limit of the log likelihood.

Because the value of an estimating function is a squarematrix, it can be decomposed into a symmetric part (equalto its transpose) and a skew symmetric part (opposite of itstranspose). This decomposition is simply

(35)

If the optimization of some regular contrast function cor-responds to an estimating function , it is found thatthe optimization of the same contrast under the whitenessconstraint corresponds to an estimating functiongiven by

(36)

Thus, the symmetric part of is replaced by[introduced at (5)], whose effect is to enforce the

whiteness constraint. In particular, maximum likelihood es-timates under the whiteness constraint are (again) solutionsof (34) with the estimating function

(37)

Other orthogonal contrast functions are associated to similarestimating functions. For instance, the simple fourth-ordercontrasts and [(23) and (26), respectively]yield estimating equations in the form (37) with nonlinearfunctions given respectively by

and (38)

Recall that using the contrast function (26) assumes sourceswith negative kurtosis Thus, the two functions in (38)agree on the sign to be given to a cubic distortion (as wasexpected).

Some contrast functions, like and whenestimated from samples are minimized at points which

cannot be represented exactly as the solution of (34) for afixed estimating function. However, one can often find, as in[36], an asymptotic estimating function in the sense that thesolution of the associated estimating equation is very closeto the minimizer of the estimated contrast. For instance,the contrast and are asymptotically associatedto the same estimating function as This implies thatminimizing or with cumulants estimatedfrom samples yields estimates which are equivalent (i.e.,they differ by a term which is smaller than the estimationerror) for large enough

Which functions are appropriate as estimating functions?One could think of using any such that as anestimating function because the estimating (34) would justbe the sample counterpart of and woulda prioriprovide as many scalar equations as unknown parameters.However, the ML principle suggests the specific forms(32) and (37), with the nonlinear functions in being(approximations of) the score functions for the probabilitydensities of the signals to be separated.

V. ADAPTIVE ALGORITHMS

A simple generic technique for optimizing an objectivefunction is gradient descent. In most optimization problems,its simplicity is at the expense of performance; more so-phisticated techniques—such as “Newton-like” algorithmsusing second derivatives in addition to the gradient—canoften significantly speed up convergence. For the BSSproblem, however, it turns out that a simple gradient de-scent offers “Newton-like” performance (see below). Thissurprising and fortunate result is obtained by descendingalong the relative gradient defined in Section IV-A.

A. Relative Gradient Techniques

1) Relative Gradient Descent:We first describe a genericrelative gradient descent. Generally, the steepest descenttechnique of minimization consists in moving by a smallstep in a direction opposite that of the gradient of theobjective function. The relative gradient of a contrast isdefined (Section IV-A) with respect to a relative variationof , by which is changed into The resultingvariation of is (at first order) the scalar product

between the relative variation and the relativegradient , as in (27) or (28). Aligning the direction ofchange in the direction opposite to the gradient is to take

for a small positive step Thus, one stepof a relative gradient descent can be formally described as

(39)

According to (28), the resulting variation of is, which is

negative for positiveThe formal description (39) can be turned into off-line

and on-line algorithms, as described next.2) Off-Line Relative Gradient Descent:Consider the sep-

aration of a batch of samples based onthe minimization of a contrast function with relative


gradient One looks for a linear transformof the data satisfying the corresponding estimating equation

The relative gradient descent tosolve it is as follows. Set for anditerate through the following two steps

(40)

(41)

The first step computes an estimate of the relativegradient for the current values of the data; the secondstep updates the data in the (relative) direction oppositeto the relative gradient, as in (39). The algorithm stopswhen , i.e., when the estimatingequation is solved. It is amusing to note that this imple-mentation does not need to maintain a separating matrix,rather, it directly operates on the data set itself with thesource signals emerging during the iterations.

3) On-Line Relative Gradient Descent:On-line algorithmsupdate a separating matrix upon reception of a newsample The (relative) linear transformcorresponds to changing into Inthe on-line mode, one uses the stochastic gradient techniquewhere the gradient is replaced by itsinstantaneous value Hence the stochastic relativegradient rule

(42)

where is a sequence of positive learning steps.4) Uniform Performance of Relative Gradient Descent:A

striking feature of the BSS model is that the hardness (in astatistical sense, as discussed in Section VI-C) of separatingmixed sources does not depend on the particular value ofthe mixing matrix ; the problem is uniformly hard in themixture. Very significantly, the device of relative updatingproduces algorithms which also behave uniformly well inthe mixture. Right-multiplying the updating rule (42) bymatrix and using one readily finds

that the trajectory of the global system whichcombines mixing and unmixing matrices is governed by

(43)

This trajectory is expressed here as a sole function of theglobal system . The only effect of the mixing matrixitself is to determine (together with ) the initial value

of the global system. This is a very desirableproperty since it means that the algorithms can be studiedand optimized without reference to the actual mixtureto be inverted. This is true for any estimating function

. However, uniformly good performance can only beexpected if the is adjusted correctly to the distributionof the source signals, e.g., by deriving it from a contrastfunction. Algorithms based on an estimating function inthe form (32) are described in [5] for the on-line versionand in [61] for an off-line version; those based on form(37) are studied in detail in [21]. The uniform performanceproperty was also obtained in [24].

5) Regular Gradient Algorithms:It is interesting to com-pare the relative gradient algorithm to the algorithm ob-tained by a regular gradient, that is, by applying a gra-dient rule to the entries of for the minimization of

This is

(44)

Not only is this form more costly because it requires (ingeneral) the inversion of at each step, but it lacks theuniform performance property. The trajectory of the globalsystem depends on the particular mixtureto be inverted.

B. Adapting to the Sources

The iterative and adaptive algorithms described aboverequire the specification of an estimating function forwhich two forms and [(32) and (37)] are sug-gested by the theory. These forms, in turn, depend onnonlinear functions which, ideally, should bethe score functions associated to the distributions of thesources (Section IV-B). When the source distributions areunknown, one may try to estimate them from the data (forinstance, using some parametric model, as in [56]) or todirectly estimate good nonlinear functions.

The first idea is to use Edgeworth expansions (e.g., see[51]) which provide approximations to probability densitiesin the vicinity of a Gaussian density. The simplest nontrivialEdgeworth approximation of a symmetric pdf in thevicinity of the standard normal distribution is

where is the kurtosis of The corresponding approximatescore function is

(45)

Thus, the Edgeworth expansion suggests that in a linear-cubic approximation of the score function, the coefficientof the cubic part should be for the th source.Asymptotic analysis shows that such a choice at leastguarantees the local stability (Section VI-A). There areother possibilities for deriving score functions by a densityexpansion, e.g., see [68] for a different proposal involvingodd and even terms in

A more direct approach than pdf expansion is proposedby Pham [61], who considers approximatingby a linearcombination

(46)

of a fixed set of arbitrary basis functions.Rather surprisingly, the set of coefficientsminimizing the mean square error be-tween the true score and its approximation can be foundwithout knowing . The best mean square approximationinvolves only the expectation operator. It is

(47)


Fig. 12. Top row: three distributions and the values of � as ameasure of non-Gaussianity (see Section VI-B). Bottom row: thescore function (solid) and its linear–cubic approximations, basedon Edgeworth expansion (dashes-dots) and optimal (dashes).

where is the columnvector of basis functions and is the column vectorof their derivatives. This is a nice result because theexpression of can be estimated simply by replacing in(47) expectations by sample averages and the values ofby the estimated source signals.

The two approaches of Edgeworth expansion and meansquare fit, respectively leading to the approximations (45)and (47), are compared in Fig. 12.

Three pdf’s are displayed in the top row; the bottomrow shows the corresponding score function (solid line),the linear–cubic approximation by (45) (dash-dotted line),and the Pham approximation (dashed line) obtained from(47) with Both approximations are similarin the first example when the pdf is close to Gaussian;in the second case, the optimal approximation fits the truescore much better in the area of highest probability. Noneof the approximations seem good in the third example forthe simple reason that the true score there cannot be wellapproximated by a linear–cubic function. However, the twoapproximations fit the score well enough to guarantee thestability of the gradient algorithms (see Section VI-A).

VI. PERFORMANCE ISSUES

This section is concerned with the performance of BSSalgorithms and presents some asymptotic analysis results.It has been stressed repeatedly that it was not necessary toknow the source distributions (or equivalently, the associ-ated score functions) with great accuracy in order to obtainconsistent BSS algorithms. There is, however, a limit to themisspecification of the source distributions, as illustratedby Fig. 7. This is elucidated at Section VI-A, which givesexplicit stability limits. Even if a hypothesized distributionis good enough to preserve stability, one may expect aloss of estimation accuracy due to misspecification when afinite number of samples are available; this is quantified inSection VI-B, which also describes the ultimate achievableseparation performance. Section VI-C discusses the generalproperty of equivariance which governs the performance ofBSS algorithms.

Fig. 13. Stability domains in the(�i; �j) plane.

A. Local Stability

A stationary (or equilibrium) point of the learning rule(42) is characterized by , i.e.,the mean value of the update is zero. We have seen thatseparating matrices (with the proper scale) are equilibriumpoints; we are now interested in finding when they arelocally stable, that is, when a small deviation from theequilibrium is pulled back to the separating point. Inother words, we want the separating matrix to a (local)attractor for the learning rule (42). In the limit of smalllearning steps, there exists a simple criterion for testinglocal stability which depends on the derivative ofwith respect to For both the symmetric form andfor the asymmetric form the stability condition can beworked out exactly. They are found to depend only thefollowing nonlinear moments:

(48)

where each is rescaled according to that is,for , or for

Leaving aside the issue of stability with respect to scale,the stability conditions for the symmetric form (37) are [21]

for (49)

and for the asymmetric form (32), the conditions are [3]that for and that

for (50)

Therefore, stability appears to depend on pairwise condi-tions. The stability domains for a given pair of sources aredisplayed in Fig. 13 in the plane.

Note that the stability domain is larger for the symmetricform (37). This is a consequence of letting the second-order information (the whiteness constraint) do “half thejob” (see Section II-B).

Some comments are in order. First, it appears that inboth cases a sufficient stability condition is for allthe sources. Thus, regarding stability, tuning the nonlinearfunctions ’s to the source distributions should be under-stood as making the ’s positive. Second, one can show


that if is Gaussian, then for any functionTherefore, the stability conditions can never be met if thereis more than one Gaussian source, in agreement with theidentifiability statements of Section II. Third, it can also beshown that if is taken to be the score function for thetrue density of then with equality only if isGaussian.

Section II-C illustrated the fact that the hypothesizedsource distributions should be close enough to the truedistributions for the likelihood to still show a maximumaround a separating point. The definition of providesa quantitative measure of how wrong the hypothesis canbe—they should not allow to become negative.

We also note that it is not necessary for all the’sto be positive; if for one source at most, this canbe compensated if the moments are large enough forall As seen from the stability domains (Fig. 13),one source at most can have an arbitrarily negativeif the symmetric form is used, while the stability of theasymmetric form requests that

We have considered linear–cubic score functions inSections IV and V. If for twoconstants and thenwhere, as above, denotes the kurtosis. Note that the linearpart of does not affect the stability and that stability isguaranteed if the coefficient of the cubic part has a signopposite to the sign of the kurtosis. Quite naturally, thefunctions in (38) and (45) come up naturally with the rightsign. Therefore, if one wishes to use cubic nonlinearities,it is sufficient to know the sign of the kurtosis of eachsource to make separating matrices stable. For other thancubic scores, stability depends on the sign of not onthe sign of the kurtosis.

B. Accuracy of Estimating Equations

This section characterizes the accuracy of signal separa-tion obtained by solving (34) with independent realiza-tions of

If a matrix is used for separation, theth entry ofcontains the signal of interest at

power , and the th interfering signal at powerTherefore, for a given matrix , the quantity

(51)

measures the interference-to-signal ratio (ISR) providedby in rejecting the th source in the estimate of the

th source. Let be the separating matrix obtained viaa particular algorithm using samples. In general, theestimation error in regular statistical models decreases as

so that the limit

(52)

usually exists and provides an asymptotic measure ofperformance separation of a given off-line BSS technique.When or are used in the estimating

equation, the asymptotic ISR depends on the momentsin (48) and

(53)

For simplicity, we consider identically distributed signalsand identical nonlinear functions so that

and for With a symmetricestimating function one finds

(54)

Note that is lower bounded by 1/4 regardless of thevalue of . This is a general property of orthogonal BSStechniques [15] and is the price to pay for blindly trustingsecond-order statistics, i.e., for whitening. Thus rejectionrates obtained under the whiteness constraint cannot be(asymptotically) better that

For an asymmetric estimating function , the ISR doesnot take such a simple form unless the common scoreisobtained by Pham’s method (Section V-B). One then finds

and as

(55)

where the last equation stems from (54) because Pham’smethod guarantees These expressions show thatboth ISR and ISR are minimized by maximizing notsurprisingly, can be shown to reach its maximum value

precisely when , where is the score functioncorresponding to the true density of the sources

(56)

Note that the solution of (34) with then isthe ML estimator based on the true model. It followsthat the expression of ISR in (55) also is the asymptoticCramer–Rao bound for source separation, i.e., the bestachievable ISR rate with independent samples (see [58],[69], and [61]).

Since the achievable performance depends on the mag-nitude of this moment characterizes the hardness ofthe BSS problem with respect to source distribution. Notsurprisingly, we can relate it to the non-Gaussianity ofthe sources as follows. As above, we denotethe scorefunction for the (true) distribution of , and we denote

the score function for the Gaussian distribution withthe same variance as (this is just A“large” non-Gaussianity translates into a large differencebetween and As we just saw, the measure of non-Gaussianity from the asymptotic point of view is measuredby Indeed, one finds

(57)

See Fig. 12 for the values of in three examples. Forclose-to-Gaussian sources, is (arbitrarily) small. In thiscase, according to (55), the best achievable rejection rates


are about for both the symmetric and the asymmet-ric forms. This gives an idea of the minimum number ofsamples required to achieve a given separation. The otherextreme is for sources which are far away from normality:the moment is not bounded above. In particular, it tendsto when the source distributions tend to have a discreteor a bounded support. In the case of discrete sources,deterministic (error-free) blind identification is possiblewith a finite number of samples. In the case of sources withbounded support, the MSE of blind identification decreasesat a much faster rate than the rate obtained for finitevalues of (see in particular [41]).

C. Equivariance and Uniform Performance

At first thought, the hardness of the BSS problem seemsto depend on the distributions of the source signals and onthe mixing matrix, with harder problems occurring whensources are nearly Gaussian and when the mixing matrixis poorly conditioned. This is not correct, however. TheBSS problem is uniformly hard in the mixing matrix. Letus summarize the instances where this property appeared:the ultimate separation performance depends only on[(55)]; the asymptotic performance index in (54) and (55)depends only on some statistical moments; the stability ofthe adaptive algorithms (42) also depends only on the valuesof ’s; even better, the trajectory (43) of the global system

does not depend on, whose sole effect is todetermine the initial point.

Therefore, not only does the problem appear to be uni-formly hard in the mixing matrix, but there exist estimationtechniques with a statistical behavior (regarding signalseparation) which is independent of the particular value ofthe system to be inverted. This is a very desirable property:such algorithms can be studied and tuned independently ofthe particular mixture to be inverted and their performancecan also be predicted independently of the mixture [16].This is an instance of “equivariance,” a property holdingmore generally in transformation models.

There is a simple prescription to design algorithms withuniform performance: adjust the separating matrix freely(i.e., without constraint) according to a rule expressed onlyin terms of the output To understand why the “outputonly” prescription ensures uniform performance, considerusing a particular estimating function to separate amixture of samples If the source signalsare mixed by a given matrix then a solution of (34) is amatrix such that where matrix is a solution of

Matrix does not depend onso that the global system is itself independent of

, and the estimated signals are regardless ofIn particular, the recovered signals are exactly identical

to those that would be obtained with , i.e., when thereis no mixing at all. This argument, based on estimatingequations, extends to the minimizers of contrast functionssince the latter are defined as functions of the distribution ofthe output (the argument also applies to orthogonal contrastfunctions because the whiteness constraint is expressed only

in terms of ). The argument also justifies the specificdefinition of the relative gradient; a device was needed toexpress the first-order variations of a contrast functionin terms of a variation of itself, i.e., without referenceto Finally, it must be stressed that the argument doesnot involve asymptotics, rather, equivariance is observedexactly for any finite value of

Not all BSS algorithms are equivariant. For instance,Jutten and H´erault’s original algorithm imposes constraintson the separating matrix resulting in a greatly complicatedanalysis (and behavior) (see [34], [41], and [48]). Otherinstances of nonequivariant techniques are to be found inmost of the algebraic approaches (see Section VII) basedon the structure of the cumulants of the observed vectorPrecisely because the identification is based onand noton such approaches are not generally equivariant unlessthey can be shown to be equivalent to the optimization ofa contrast function of

A word of caution is necessary before concluding: equiv-ariance holds exactly in the noise-free model which wehave considered so far. In practice, there is always somekind of noise which must be taken into account. Assumethat a better model is , where represents anadditive noise. This can be rewritten asAs long as can be neglected with respect tothis isa noise-free situation. This shows the limit of equivariance:a poorly conditioned matrix has a large inverse whichamplifies the effect of the noise. More precisely, we canexpect equivariance in the high-SNR domain, i.e., when thecovariance matrix of remains “larger” than the covariancematrix of

VII. CONCLUSIONS

Due to limited space, the focus of this paper is onprinciples and many interesting issues have been left out,such as a discussion of the connections between BSSand blind deconvolution, convergence rates of adaptivealgorithms, design of consistent estimators based on noisyobservations, detection of the number of sources, etc.Before concluding, we briefly mention some other points.

A. Algebraic Approaches

The fourth-order cumulants of have a very regularstructure in the BSS model

(58)Given sample estimates of the cumulants, the equationset (58) (or some subset of it) can be solved inin aleast square sense. This is a cumulant matching approach[70], [42] which does not yield equivariant estimates.Optimal matching, though, can be shown to correspondto a contrast function [21]. However, the specific form of(58) also calls for algebraic approaches. Simple algorithmscan be based on the eigen-structure of cumulant matricesbuilt from cumulants [62], [64]. An exciting direction of


research is to investigate high-order decompositions thatwould generalize matrix factorizations like SVD or EVDto fourth-order cumulants [32], [29], [14], [26], [20].

B. Using Temporal Correlation

The approaches to BSS described above only exploitproperties of the distribution of If the source signalsare temporally correlated, time structures can also be ex-ploited. It is possible to achieve separation if all the sourcesignals have distinct spectra, even if each source signal isa Gaussian process [66]. Simple algebraic techniques canbe devised (see [65] and [10]); the Whittle approxima-tion of likelihood is investigated in [60]. Cyclostationaryproperties, when they exist, can also be exploited [12].

C. Deterministic Identification

As indicated in Section VI-B, sources with discrete sup-port allow for deterministic identification (infinite Fisherinformation). Specific contrast functions can be devised[41] to take advantage of discreteness. There is a richdomain of application with digital communication signalscoding information with discrete symbols by which deter-ministic identification is possible. See the review by Vander Veen [67] and the papers on CMA in this issue.

D. Open Problems and Perspectives

1) Learning Source Distributions:In the BSS problem,source distributions are a nuisance parameter. For a largeenough sample size, it is possible to estimate the distribu-tions and still obtain the same asymptotic performance asif the distributions were known in advance [2]; the designof practical algorithms achieving source adaptivity still isan open question.

2) Dealing with Noise:BSS techniques remaining con-sistent in the presence of additive noise has not beendescribed here. For additive Gaussian noise, such tech-niques may resort to high-order cumulants or to noisemodeling. It is not clear, however, that it is worth combatingthe noise. As a matter of fact, one may argue that takingnoise effects into account is unnecessary at high-SNR andfutile at low-SNR (because the BSS problem becomes toodifficult anyway). Therefore, we believe it is still an openquestion to determine which application domains wouldreally benefit from noise modeling.

3) Global Convergence:Some cumulant-based contrastfunctions can be proven to be free of spurious local minimain the two-source case (e.g., see [28]) or in a deflationapproach (successive extractions of the source signals) [33],[34]. There is, however, a lack of general understanding ofthe global shape of contrast functions in the general case.

4) Multidimensional Independent Components:An inter-esting original variation of the basic independent com-ponent analysis (ICA) model would be to decompose arandom vector in a sum of independent components with therequirement that the components are linearly independentbut not necessarily one-dimensional. In the BSS model,this would be equivalent to grouping the source signals

in subsets with independence between the subsets but notwithin the subsets. This more general decomposition couldbe called multidimensional independent component analysis(MICA).

5) Convolutive Mixtures:The most challenging openproblem in BSS probably is the extension to convolutivemixtures. This is a very active area of research, mainlymotivated by applications in the audio-frequency domainwhere the BSS is often termed the “cocktail-party problem.”The convolutive problem is significantly harder than theinstantaneous problem: even input/output (i.e., nonblind)identification is very challenging because of the largenumber of parameters usually necessary to describe audiochannels.

6) When the Model Does Not Hold:The introduction men-tioned successful applications of BSS to biomedical signals.When examining these data it is very striking to realizethat the extracted source signals seem to be very far fromobeying the simple BSS model. The fact that BSS stillyields apparently meaningful (to the experts) results isworth of consideration. A partial explanation stems frombasing separation on contrast functions: even if the modeldoes not hold (there are no independent source signals andno system to be inverted), the algorithms still try toproduce output which are as independent as possible. Thisdoes not tell the whole story, though, because for many datasets a stochastic description does not seem appropriate. Webelieve it will be a very interesting challenge to understandthe behavior of BSS algorithms when applied “outside ofthe model.”

ACKNOWLEDGMENT

The author is indebted to the anonymous reviewers whoseconstructive comments helped in improving the first versionof this paper.

REFERENCES

[1] S.-I. Amari, “Natural gradient works efficiently in learning,”Neural Computation, vol. 10, pp. 251–276, 1998.

[2] S.-I. Amari and J.-F. Cardoso, “Blind sourceseparation—Semiparametric statistical approach,”IEEETrans. Signal Processing, vol. 45, pp. 2692–2700, Nov. 1997.

[3] S.-I. Amari, T.-P. Chen, and A. Cichocki, “Stability analysisof adaptive blind source separation,”Neural Networks, vol. 10,no. 8, pp. 1345–1351, 1997.

[4] S.-I. Amari, A. Cichocki, and H. Yang, “A new learningalgorithm for blind signal separation,” inAdvances in NeuralInformation Processing Systems, vol. 8. Denver, CO: MITPress, pp. 757–763, 1996.

[5] K. Anand, G. Mathew, and V. Reddy, “Blind separation ofmultiple co-channel BPSK signals arriving at an antenna array,”IEEE Signal Processing Lett., vol. 2, pp. 176–178, Sept. 1995.

[6] A. Back and A. Weigend, “A first application of independentcomponent analysis to extracting structure from stock returns,”Int. J. Neural Syst., vol. 8, no. 4, Aug. 1997.

[7] Y. Bar-Ness, “Bootstraping adaptive interference cancelers:Some practical limitations,” inProc. Globecom, Nov. 1982, pp.1251–1255.

[8] A. J. Bell and T. J. Sejnowski, “An information-maximizationapproach to blind separation and blind deconvolution,”NeuralComputation, vol. 7, no. 6, pp. 1004–1034, 1995.

[9] , “Edges are the ‘independent components’ of naturalscenes,” inAdvances in Neural Information Processing Systems,vol. 9. Denver, CO: MIT Press, 1996.


[10] A. Belouchrani, K. A. Meraim, J.-F. Cardoso, andE. Moulines,“A blind source separation technique based on second-orderstatistics,”IEEE Trans. Signal Processing, vol. 45, pp. 434–44,Feb. 1997.

[11] A. Benveniste, M. Goursat, and G. Ruget, “Robust identificationof a nonminimum phase system, blind adjustment of a linearequalizer in data communication,”IEEE Trans. Automat. Contr.,vol. 25, pp. 385–399, March 1980.

[12] B. G. Agee, S. V. Schell, and W. A. Gardner, “Spectral self-coherence restoral: A new approach to blind adaptive signalextraction using antenna arrays,”Proc. IEEE, vol. 78, pp.753–766, April 1990.

[13] X.-R. Cao and R.-W. Liu, “General approach to blind sourceseparation,” IEEE Trans. Signal Processing, vol. 44, pp.562–571, Mar. 1996.

[14] J.-F. Cardoso, “Super-symmetric decomposition of the fourth-order cumulant tensor. Blind identification of more sources thansensors,” inProc. ICASSP, 1991, pp. 3109–3112.

[15] , “On the performance of orthogonal source separationalgorithms,” in Proc. EUSIPCO, Edinburgh, Sept. 1994, pp.776–779.

[16] , “The equivariant approach to source separation,” inProc.NOLTA, 1995, pp. 55–60.

[17] , “Estimating equations for source separation,” inProc.ICASSP’97, pp. 3449–52.

[18] , “Infomax and maximum likelihood for source separa-tion,” IEEE Signal Processing Lett., vol. 4, pp. 112–114, Apr.1997.

[19] J.-F. Cardoso, S. Bose, and B. Friedlander, “On optimal sourceseparation based on second and fourth order cumulants,” inProc. IEEE Workshop on SSAP, Corfou, Greece, 1996.

[20] J.-F. Cardoso and P. Comon, “Tensor based independent com-ponent analysis,” inProc. EUSIPCO, 1990, pp. 673–676.

[21] J.-F. Cardoso and B. Laheld, “Equivariant adaptive source sepa-ration,” IEEE Trans. Signal Processing, vol. 44, pp. 3017–3030,Dec. 1996.

[22] J.-F. Cardoso and A. Souloumiac, “Blind beamforming for non-Gaussian signals,”Proc. Inst. Elect. Eng., vol. 140, pt. F, no.6, pp. 362–370, Dec. 1993.

[23] E. Chaumette, P. Comon, and D. Muller, “ICA-based tech-nique for radiating sources estimation: Application to airportsurveillance,”Proc. Inst. Elect. Eng., vol. 140, pt. F, no. 6, pp.395–401, Dec. 1993.

[24] A. Cichocki, R. Unbehauen, and E. Rummert, “Robust learningalgorithm for blind separation of signals,”Electronic Lett., vol.30, no. 17, pp. 1386–87, 1994.

[25] P. Comon, “Independent component analysis, a new concept?,”Signal Processing, vol. 36, no. 3, pp. 287–314, Apr. 1994.

[26] P. Comon and B. Mourrain, “Decomposition of quantics in sumsof powers of linear forms,”Signal Processing, vol. 53, no. 2,pp. 93–107, Sept. 1996.

[27] T. M. Cover and J. A. Thomas,Elements of Information Theory.New York: Wiley, 1991.

[28] A. Dapena, L. Castedo, and C. Escudero, “An unconstrainedsingle stage criterion for blind source separation,” inProc.ICASSP, vol. 5, 1996, pp. 2706–2709.

[29] L. De Lathauwer, B. De Moor, and J. Vandewalle, “Blind sourceseparation by higher order singular value decomposition,” inProc. EUSIPCO, vol. 1, 1994, pp. 175–178.

[30] , “Fetal electrocardiogram extraction by source subspaceseparation,” inProc. HOS’95, Aiguablava, Spain, June 1995,pp. 134–138.

[31] , “Blind source separation by simultaneous third-ordertensor diagonalization,” inProc. EUSIPCO, Trieste, Italy, 1996,pp. 2089–2092.

[32] , “Independent component analysis based on higher-orderstatistics only,” inProc. IEEE SSAP Workshop, Corfu, Greece,1996, pp. 356–359.

[33] N. Delfosse and P. Loubaton, “Adaptative blind separation ofindependent sources: A deflation approach,”Signal Processing,vol. 45, no. 1, pp. 59–83, 1995.

[34] Y. Deville, “A unified stability analysis of the Herault–Juttensource separation neural network,”Signal Processing, vol. 51,no. 3, pp. 229–233, June 1996.

[35] Y. Deville and L. Andry, “Application of blind source separa-tion techniques to multi-tag contactless identification system,”IEICE Trans. Fundamentals Electron., Commun., Comput. Sci.,vol. E79-A(10), pp. 1694–1699, 1996.

[36] D. Donoho, “On minimum entropy deconvolution,” inAp-plied Time-Series Analysis II. New York: Academic, 1981, pp.565–609.

[37] G. d’Urso, P. Prieur, and C. Vincent, “Blind identification meth-ods applied to EDF civil works and power plants monitoring,”in Proc. HOS’97, Banff, Canada, June 1997.

[38] L. Fety, “Methodes de traitement d’antenne adaptees aux radio-communications,” Ph.D. dissertation, Telecom Paris, Paris,France, June 1988.

[39] J.-C. Fort, “Stability of the source separation algorithm ofJutten and Hérault,” in Artificial Neural Networks, T. Koho-nen, Makasira, Simula, and Kangas, Eds. Amsterdam, TheNetherlands: Elsevier, 1991, pp. 937–941.

[40] M. Gaeta and J.-L. Lacoume, “Source separation withoutapriori knowledge: The maximum likelihood solution,” inProc.EUSIPCO, 1990, pp. 621–624.

[41] F. Gamboa andE. Gassiat, “Source separation when the inputsources are discrete or have constant modulus,”IEEE Trans.Signal Processing, vol. 45, pp. 3062–3072, Dec. 1997.

[42] G. Giannakis and S. Shamsunder, “Modeling of non-Gaussianarray data using cumulants: DOA estimation of more sourceswith less sensors,”Signal Processing, vol. 30, no. 3, pp.279–297, July 1993.

[43] J. Herault, C. Jutten, and B. Ans, “Détection de grandeursprimitives dans un message composite par une architecturede calcul neuromimétique en apprentissage non supervisé,” inProc. GRETSI, Nice, France, 1985, pp. 1017–1020.

[44] A. Hyvarinen and E. Oja, “A fast fixed-point algorithm forindependent component analysis,”Neural Computation, vol. 9,no. 7, pp. 1483–1492, 1997.

[45] C. Jutten and J. Herault, “Blind separation of sources I. Anadaptive algorithm based on neuromimetic architecture,”SignalProcessing, vol. 24, no. 1, pp. 1–10, July 1991.

[46] J. Karhunen, A. Hyvarinen, R. Vigario, J. Hurri, and E. Oja,“Applications of neural blind separation to signal and imageprocessing,” inProc. ICASSP, vol. 1, 1997, pp. 131–134.

[47] J. Karhunen, E. Oja, L. Wang, R. Vigário, and J. Joutsensalo,“A class of neural networks for independent component analy-sis,” IEEE Trans. Neural Networks, vol. 8, pp. 486–504, May1997.

[48] O. Macchi andE. Moreau, “Self-adaptive source separation,Part I: Convergence analysis of a direct linear network con-troled by the Herault–Jutten algorithm,”IEEE Trans. SignalProcessing, vol. 45, pp. 918–926, Apr. 1997.

[49] D. J. C. MacKay, submitted for publication.[50] S. Makeig, A. Bell, T.-P. Jung, and T. J. Sejnowski, “Inde-

pendent component analysis of electroencephalographic data,”in Advances in Neural Information Processing Systems, vol. 8.Denver, CO: MIT Press, 1995.

[51] P. McCullagh,Tensor Methods in Statistics, (Monographs onStatistics and Applied Probability). London, U.K.: Chapman& Hall, 1987.

[52] J. Moody and L. Wu, “What is the ‘true price’?—State spacemodels for high frequency financial data,” inProgress inNeural Information Processing. Proceedings of the InternationalConference on Neural Information Processing, vol. 2. NewYork: Springer-Verlag, 1996, pp. 697–704.

[53] E. Moreau and O. Macchi, “High order contrasts for self-adaptive source separation,”Int. J. Adaptive Contr. SignalProcessing, vol. 10, no. 1, pp. 19–46, Jan. 1996.

[54] J.-P. Nadal and N. Parga, “Nonlinear neurons in the low-noise limit: A factorial code maximizes information transfer,”NETWORK, vol. 5, pp. 565–581, 1994.

[55] D. Obradovic and G. Deco, “An information theory based learn-ing paradigm for linear feature extraction,”Neurocomputing,vol. 12, pp. 203–221, 1996.

[56] B. A. Pearlmutter and L. C. Parra, “A context-sensitive gener-alization of ICA,” presented atInt. Conf. on Neural InformationProcessing, Hong Kong, 1996.

[57] J. Pfanzagl, “Asymptotic expansions related to minimum con-trast estimators,”Ann. Statistics, vol. 1, no. 6, pp. 993–1026,1973.

[58] D.-T. Pham, “Blind separation of instantaneous mixture ofsources via an independent component analysis,” LMC IMAG,Grenoble, France, Res. Rep. RT 119, 1994.

[59] , “Blind separation of instantaneous mixture of sourcesvia an independent component analysis,”IEEE Trans. SignalProcessing, vol. 44, pp. 2768–2779, Nov. 1996.


[60] D.-T. Pham and P. Garat, “S´eparation aveugle de sources tem-porellement correlees,” inProc. GRETSI, 1993, pp. 317–320.

[61] , “Blind separation of mixture of independent sourcesthrough a quasimaximum likelihood approach,”IEEE Trans.Signal Processing, vol. 45, pp. 1712–1725, July 1997.

[62] V. C. Soon, L. Tong, Y. F. Huang, and R. Liu, “An extendedfourth order blind identification algorithm in spatially correlatednoise,” in Proc. ICASSP, 1990, pp. 1365–1368.

[63] A. Swindlehurst, M. Goris, and B. Ottersten, “Some experi-ments with array data collected in actual urban and suburbanenvironments,” inIEEE Workshop on Signal Processing Ad-vances in Wireless Communications, Apr. 1997, pp. 301–304.

[64] L. Tong, Y. Inouye, and R. Liu, “Waveform preserving blindestimation of multiple independent sources,”IEEE Trans. SignalProcessing, vol. 41, pp. 2461–2470, July 1993.

[65] L. Tong, V. Soon, Y. Huang, and R. Liu, “AMUSE: A newblind identification algorithm,” inProc. ISCAS, 1990.

[66] , “A necessary and sufficient condition for the blind iden-tification of memoryless systems,” inProc. ISCAS, Singapore,vol. 1, pp. 1–4, 1991.

[67] A.-J. van der Veen, “Algebraic methods for deterministic blindbeamforming,” this issue, pp. 1987–2008.

[68] H. H. Yang and S. Amari, “Adaptive on-line learning algo-rithms for blind separation—Maximum entropy and minimummutual information,”Neural Computation, vol. 9, no. 7, pp.1457–1482, Oct. 1997.

[69] D. Yellin and B. Friedlander, “Multi-channel system identifica-tion and deconvolution: Performance bounds,” inProc. IEEESSAP Workshop, Corfu, Greece, 1996, pp. 582–585.

[70] N. Yuen and B. Friedlander, “Asymptotic performance analysisof blind signal copy using fourth-order cumulants,”Int. J.Adaptive Contr. Signal Processing, vol. ?, Mar. 1996, pp.239–65.

Jean-Francois Cardoso (Member, IEEE) re-ceived the Ph.D. degree in physics fromEcoleNormale Superieure de Saint-Cloud, France, in1984.

He currently is with the C.N.R.S. (CentreNational de la Recherche Scientifique), Paris,France, where he works in the Signal and Im-ages department of Telecom Paris (E.N.S.T.).His research interests are in statistical signalprocessing with an emphasis on (blind) arrayprocessing, performance analysis, and connec-

tions to information theory.


blind signal separation: statistical...

Documents