high-ordercontrasts for independent component analysis · letter communicated by anthony bell...

18
LETTER Communicated by Anthony Bell High-Order Contrasts for Independent Component Analysis Jean-Fran¸ cois Cardoso Ecole Nationale Sup ´ erieure des T ´ el ´ ecommunications, 75634 Paris Cedex 13, France This article considers high-order measures of independence for the inde- pendent component analysis problem and discusses the class of Jacobi al- gorithms for their optimization. Several implementations are discussed. We compare the proposed approaches with gradient-based techniques from the algorithmic point of view and also on a set of biomedical data. 1 Introduction Given an n × 1 random vector X , independent component analysis (ICA) consists of finding a basis of R n on which the coefficients of X are as inde- pendent as possible (in some appropriate sense). The change of basis can be represented by an n × n matrix B and the new coefficients given by the entries of vector Y = BX. When the observation vector X is modeled as a linear superposition of source signals, matrix B is understood as a separat- ing matrix, and vector Y = BX is a vector of source signals. Two key issues of ICA are the definition of a measure of independence and the design of algorithms to find the change of basis (or separating matrix) B optimizing this measure. Many recent contributions to the ICA problem in the neural network literature describe stochastic gradient algorithms involving as an essential device in their learning rule a nonlinear activation function. Other ideas for ICA, most of them found in the signal processing literature, exploit the algebraic structure of high-order moments of the observations. They are of- ten regarded as being unreliable, inaccurate, slowly convergent, and utterly sensitive to outliers. As a matter of fact, it is fairly easy to devise an ICA method displaying all these flaws and working on only carefully generated synthetic data sets. This may be the reason that cumulant-based algebraic methods are largely ignored by the researchers of the neural network com- munity involved in ICA. This article tries to correct this view by showing how high-order correlations can be efficiently exploited to reveal indepen- dent components. This article describes several ICA algorithms that may be called Jacobi algorithms because they seek to maximize measures of independence by a technique akin to the Jacobi method of diagonalization. These measures of independence are based on fourth-order correlations between the entries of Y . As a benefit, these algorithms evades the curse of gradient descent: Neural Computation 11, 157–192 (1999) c 1999 Massachusetts Institute of Technology 158 Jean-Fran¸ cois Cardoso they can move in macroscopic steps through the parameter space. They also have other benefits and drawbacks, which are discussed in the article and summarized in a final section. Before outlining the content of this article, we briefly review some gradient-based ICA methods and the notion of contrast function. 1.1 Gradient Techniques for ICA. Many online solutions for ICA that have been proposed recently have the merit of a simple implementation. Among these adaptive procedures, a specific class can be singled out: al- gorithms based on a multiplicative update of an estimate B (t ) of B. These algorithms update a separating matrix B (t ) on reception of a new sample x(t ) according to the learning rule y(t ) = B (t )x(t ), B (t + 1 ) = ( I - μ t H(y(t )) ) B (t ), (1.1) where I denotes the n × n identity matrix, {μ t } is a scalar sequence of pos- itive learning steps, and H: R n R n×n is a vector-to-matrix function. The stationary points of such algorithms are characterized by the condition that the update has zero mean, that is, by the condition, EH(Y ) = 0 . (1.2) The online scheme, in equation 1.1, can be (and often is) implemented in an off-line manner. Using T samples X (1 ),..., X (T), one goes through the following iterations where the field H is averaged over all the data points: 1. Initialization. Set y(t ) = x(t ) for t = 1 ,..., T. 2. Estimate the average field. H = 1 T T t=1 H(y(t )). 3. Update. If H is small enough, stop; else update each data point y(t ) by y(t ) (I - μH)y(t ) and go to 2. The algorithm stops for a (arbitrarily) small value of the average field: it solves the estimating equation, 1 T T t=1 H(y(t )) = 0 , (1.3) which is the sample counterpart of the stationarity condition in equation 1.2. Both the online and off-line schemes are gradient algorithms: the map- ping H(·) can be obtained as the gradient (the relative gradient [Cardoso & Laheld, 1996] or Amari’s natural gradient [1996]) of some contrast function, that is, a real-valued measure of how far the distribution Y is from some ideal distribution, typically a distribution of independent components. In

Upload: dangkien

Post on 02-Sep-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole

LETTER Communicatedby Anthony Bell

High-Order Contrasts for IndependentComponentAnalysis

Jean-FrancoisCardosoEcoleNationaleSuperieuredesTelecommunications,75634ParisCedex13,France

This article considershigh-order measuresof independencefor the inde-pendentcomponentanalysisproblem anddiscussestheclassof Jacobial-gorithms for their optimization. Severalimplementationsare discussed.We compare the proposed approacheswith gradient-basedtechniquesfrom the algorithmic point of view and alsoon a setof biomedicaldata.

1 Introduction

Givenann × 1 randomvectorX, independentcomponentanalysis(ICA)consistsof finding abasisof Rn onwhich thecoefficientsof X areas inde-pendentaspossible(in someappropriatesense).Thechangeof basiscanberepresentedby ann × n matrixB andthenewcoefficientsgivenby theentriesof vectorY = BX. WhentheobservationvectorX is modeledasalinearsuperpositionof sourcesignals,matrixB is understoodasaseparat-ing matrix,andvectorY = BX is avectorof sourcesignals.Two keyissuesof ICA are thedefinitionof a measure of independenceandthedesignofalgorithmsto find thechangeof basis(or separatingmatrix)B optimizingthismeasure.

Many recentcontributionsto the ICA problemin the neuralnetworkliteraturedescribestochasticgradientalgorithmsinvolving asanessentialdevicein their learningrule a nonlinearactivationfunction.Other ideasfor ICA, mostof themfoundin thesignalprocessingliterature,exploit thealgebraicstructureof high-ordermomentsof theobservations.Theyareof-tenregardedasbeingunreliable,inaccurate,slowlyconvergent,andutterlysensitiveto outliers.As a matterof fact, it is fairly easyto devisean ICAmethoddisplayingall theseflawsandworkingononlycarefully generatedsyntheticdatasets.Thismaybethereasonthatcumulant-basedalgebraicmethodsarelargelyignoredby theresearchersof theneuralnetworkcom-munity involvedin ICA. Thisarticletriesto correctthis view by showinghowhigh-ordercorrelationscanbeefficiently exploitedto revealindepen-dentcomponents.

This articledescribesseveralICA algorithmsthatmaybecalledJacobialgorithmsbecausetheyseekto maximizemeasuresof independenceby atechniqueakin to theJacobimethodof diagonalization.Thesemeasuresofindependenceare basedon fourth-order correlationsbetweenthe entriesof Y. As a benefit,thesealgorithmsevadesthecurseof gradientdescent:

NeuralComputation11, 157–192(1999) c© 1999MassachusettsInstituteof Technology

158 Jean-FrancoisCardoso

theycanmovein macroscopicstepsthroughtheparameterspace.Theyalsohaveotherbenefitsanddrawbacks,whicharediscussedin thearticleandsummarizedin afinalsection.Beforeoutliningthecontentof thisarticle,webrieflyreviewsomegradient-basedICA methodsandthenotionof contrastfunction.

1.1 Gradient Techniquesfor ICA. Manyonlinesolutionsfor ICA thathavebeenproposedrecentlyhavethemerit of a simpleimplementation.Amongtheseadaptiveprocedures,a specificclasscanbesingledout: al-gorithmsbasedon a multiplicativeupdateof anestimateB(t) of B. Thesealgorithmsupdatea separatingmatrix B(t) on receptionof a newsamplex(t) according to thelearningrule

y(t) = B(t)x(t), B(t + 1) =(I − µtH(y(t))

)B(t), (1.1)

where I denotesthen × n identitymatrix, {µt} is ascalarsequenceof pos-itive learningsteps,andH: Rn → Rn×n is avector-to-matrixfunction.Thestationarypointsof suchalgorithmsarecharacterizedby theconditionthattheupdatehaszero mean,thatis,by thecondition,

EH(Y) = 0. (1.2)

Theonlinescheme,in equation1.1,canbe(andoften is) implementedinanoff-line manner. UsingT samplesX(1), . . . , X(T), onegoesthroughthefollowing iterationswherethefield H is averagedoverall thedatapoints:

1. Initialization. Sety(t) = x(t) for t = 1, . . . , T.

2. Estimatetheaveragefield.H = 1T

∑Tt=1H(y(t)).

3. Update. If H is smallenough,stop;elseupdateeachdatapointy(t) byy(t) ← (I − µH)y(t) andgoto 2.

Thealgorithmstopsfor a (arbitrarily) small valueof the averagefield: itsolvestheestimatingequation,

1

T

T∑

t=1

H(y(t)) = 0, (1.3)

which is thesamplecounterpartof thestationarityconditionin equation1.2.

Both theonlineandoff-line schemesaregradientalgorithms:themap-pingH(·) canbeobtainedasthegradient(therelativegradient[Cardoso&Laheld,1996]orAmari’snaturalgradient[1996])of somecontrastfunction,that is, a real-valuedmeasure of how far thedistributionY is from someidealdistribution,typically a distributionof independentcomponents.In

Page 2: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole

High-OrderContrastsfor IndependentComponentAnalysis 159

particular, thegradientof theinfomax—maximumlikelihood(ML) contrastyieldsafunctionH(·) in theform

H(y) = ψ(y)y† − I, (1.4)

whereψ(y) is ann × 1vectorof component-wisenonlinearfunctionswithψi(·) takento beminusthelog derivativeof thedensityof thei component(seeAmari, Cichocki,& Yang,1996,for the online versionand Pham&Garat,1997,for abatchtechnique).

1.2 The Orthogonal Approach to ICA. In the search for independentcomponents,onemay decide,asin principal componentanalysis(PCA),to requestexactdecorrelation(second-order independence)of thecompo-nents:matrixB shouldbesuchthatY = BX is “spatiallywhite,” thatis, itscovariancematrix is the identity matrix. Thealgorithmsdescribedin thisarticletakethisdesignoption,whichwecall theorthogonalapproach.

It mustbestressedthatcomponentsthatareasindependentaspossibleaccording to somemeasure of independenceare not necessarilyuncorre-latedbecauseexactindependencecannotbeachievedin mostpracticalap-plications.Thus,if decorrelationis desired,it mustbeenforcedexplicitly;the algorithmsdescribedbelow optimizeunderthe whitenessconstraintapproximationsof themutualinformationandof othercontrastfunctions(possiblydesignedto takeadvantageof thewhitenessconstraint).

Onepracticalreasonfor consideringtheorthogonalapproachis thatoff-line contrastoptimizationmay be simplified by a two-stepprocedure asfollows. First,a whitening(or “sphering”)matrix W is computedandap-plied to the data.Sincethe new dataare spatiallywhite andoneis alsolookingfor awhitevectorY, thelattercanbeobtainedonlybyanorthonor-maltransformationV of thewhiteneddatabecauseonlyorthonormaltrans-formscanpreservethewhiteness.Thus,in sucha scheme,theseparatingmatrixB is foundasaproductB = VW. Thisapproachleadsto interestingimplementationsbecausethewhiteningmatrixcanbeobtainedstraightfor-wardly asanymatrixsquarerootof theinversecovariancematrixof X andtheoptimizationof acontrastfunctionwith respectto anorthonormalma-trix canalsobeimplementedefficiently by theJacobitechniquedescribedin section4.

Theorthonormalapproachto ICA neednot beimplementedasa two-stageJacobi-basedprocedure;it alsoexistsasaone-stagegradientalgorithm(seealsoCardoso& Laheld,1996).Assumethattherelative/naturalgradientof somecontrastfunctionleadsto aparticularfunctionH(·) for theupdaterule,equation1.1,with stationarypointsgivenbyequation1.2.Thenthesta-tionarypointsfor theoptimizationof thesamecontrastfunctionwith respectto orthonormaltransformationsare characterizedby EH(Y) − H(Y)† = 0wherethesuperscript†denotestransposition.Ontheotherhand,for zero-meanvariables,thewhitenessconstraintis EYY† = I, which we canalso

160 Jean-FrancoisCardoso

writeasEYY†− I = 0.BecauseEYY†− I isasymmetricmatrix matrixwhileEH(Y) − H(Y)† is a skew-symmetricmatrix, thewhitenessconditionandthestationarityconditioncanbecombinedin a singleoneby just addingthem.Theresultingconditionis E{YY† − I + H(Y) − H(Y)†} = 0. Whenitholdstrue,both thesymmetricpart andtheskew-symmetricpart cancel;theformerexpressesthatY is white, thelatterthatthecontrastfunctionisstationarywith respectto all orthonormaltransformations.

Thus,if thealgorithmin equation1.1optimizesagivencontrastfunctionwith H givenby equation1.4,thenthesamealgorithmoptimizesthesamecontrastfunctionunderthewhitenessconstraintwith H givenby

H(y) = yy† − I + ψ(y)y† − yψ(y)†. (1.5)

It is thussimpleto implementorthogonalversionsof gradientalgorithmsoncearegularversionis available.

1.3 Data-BasedVersusStatistic-BasedTechniques.Comon(1994)com-paresthedata-basedoptionandthestatistic-basedoption for computingoff-line anICA of a batchx(1), . . . , x(T) of T samples;this articlewill alsointroduceamixedstrategy(seesection4.3).In thedata-basedoption,succes-sivelineartransformationsareappliedto thedatasetuntil somecriterionofindependenceismaximized.Thisis theiterativetechniqueoutlinedabove.Notethatit isnotnecessarytoupdateexplicitly aseparatingmatrixB in thisscheme(althoughonemaydecideto dosoin aparticularimplementation);thedatathemselvesare updateduntil theaveragefield 1

T

∑Tt=1H(y(t)) is

small enough;the transformB is implicitly containedin the setof trans-formeddata.

Anotheroptionistosummarizethedatasetintoasmallersetof statisticscomputedonceandfor all from thedataset;thealgorithmthenestimatesa separatingmatrix asa functionof thesestatisticswithout accessingthedata.Thisoptionmaybefollowedin cumulant-basedalgebraictechniqueswherethestatisticsarecumulantsof X.

1.4 Outline of the Article. In section2,theICA problemis recastin theframeworkof (blind) identification,showinghowentropiccontrastsreadilystemfromthemaximumlikelihood(ML) principle.In section3,high-orderapproximationstotheentropiccontrastsaregiven,andtheiralgebraicstruc-tureisemphasized.Section4describesdifferentflavorsofJacobialgorithmsoptimizingfourth-ordercontrastfunctions.A comparisonbetweenJacobitechniquesandagradient-basedalgorithmis givenin section5basedonarealdatasetof electroencephalogram(EEG)recordings.

Page 3: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole

High-OrderContrastsfor IndependentComponentAnalysis 161

2 Contrast Functionsand Maximum Likelihood Identification

Implicitly or explicitly, ICA triesto fit amodelfor thedistributionof X thatis a modelof independentcomponents:X = AS, where A is an invertiblen×nmatrixandSisann×1vectorwith independententries.EstimatingtheparameterA fromsamplesof X yieldsaseparatingmatrixB = A−1. EvenifthemodelX = ASisnotexpectedtoholdexactlyformanyrealdatasets,onecanstill useit toderivecontrastfunctions.Thissectionexhibitsthecontrastfunctionsassociatedwith theestimationof A by theML principle(amoredetailedexpositioncanbefoundin Cardoso,1998).Blind separationbasedonML wasfirst consideredby GaetaandLacoume(1990)(but theauthorsusedcumulantapproximationsasthosedescribedin section3),PhamandGarat(1997),andAmari etal. (1996).

2.1 Likelihood. Assumethattheprobabilitydistributionof eachentrySi of Shasa densityri(·).1 Then,thedistributionPS of therandomvectorShasadensityr(·) in theform r(s) =

∏ni=1 ri(si), andthedensityof X for a

givenmixtureA andagivenprobabilitydensityr(·) is:

p(x; A, r) = | detA|−1r(A−1x), (2.1)

sothatthe(normalized)log-likelihoodLT(A, r) of T independentsamplesx(1), . . . , x(T) of X is

LT(A, r)def= 1

T

T∑

t=1

logp(x(t); A, r)

= 1

T

T∑

t=1

logr(A−1x(t)) − log | detA|. (2.2)

Dependingontheassumptionsmadeaboutthedensitiesr1, . . . , rn, severalcontrastfunctionscanbederivedfromthis log-likelihood.

2.2 Likelihood Contrast. Undermild assumptions,thenormalizedlog-likelihoodLT(A, r), which is asampleaverage,convergesfor largeT to itsensembleaverageby law of largenumbers:

LT(A, r) = 1

T

T∑

t=1

logr(A−1x(t)) − log | detA|

−→T→∞ Elogr(A−1x) − log | detA|, (2.3)

1 All densitiesconsideredin this articlearewith respectto theLebesguemeasureonR or Rn.

162 Jean-FrancoisCardoso

whichsimplemanipulations(Cardoso,1997)showtobeequalto−H(PX)−K(PY|PS). Hereandin thefollowing, H(·) andK(·|·), respectively, denotethedifferentialentropy andtheKullback-Leiblerdivergence.SinceH(PX)

doesnotdependonthemodelparameters,thelimit for largeT of −LT(A, r)is,up to aconstant,equalto

φML(Y)def= K(PY|PS). (2.4)

Therefore,theprincipleof ML coincideswith theminimizationof aspecificcontrastfunction,whichisnothingbutthe(Kullback)divergenceK(PY|PS)

betweenthedistributionPY of theoutputandamodeldistributionPS.Theclassicentropic contrastsfollow from this observation,depending

ontwooptions:(1) tryingornottoestimatePS fromthedataand(2) forcingor not thecomponentsto beuncorrelated.

2.3 Infomax. ThetechnicallysimpleststatisticalassumptionaboutPS

is to selectfixed densitiesr1, . . . , rn for eachcomponent,possiblyon thebasisof prior knowledge.ThenPS isafixeddistributionalassumption,andtheminimizationof φML(Y) is performedonly overPY via Y = BX. Thiscanbe rephrased:ChooseB suchthat Y = BX is ascloseaspossibleindistribution to the hypothesizedmodeldistributionPS, the closenessindistributionbeingmeasured in the Kullback divergence.This is alsothecontrastfunctionderivedfromtheinfomaxprinciplebyBell andSejnowski(1995).The connectionbetweeninfomax and ML wasnotedin Cardoso(1997),MacKay(1996),andPearlmutterandParra(1996).

2.4 Mutual Information. Thetheoreticallysimpleststatisticalassump-tion aboutPS is to assumeno modelat all. In this case,theKullbackmis-matchK(PY|PS) shouldbe minimizednot only by optimizing over B tochangethedistributionof Y = BXbutalsowith respecttoPS.ForeachfixedB, that is, for eachfixed distributionPY, theresultof this minimizationistheoretically very simple:theminimumis reachedwhenPS = PY, whichdenotesthe distributionof independentcomponentswith eachmarginaldistributionequalto the correspondingmarginal distributionof Y. Thisstemsfromthepropertythat

K(PY|PS) = K(PY|PY) + K(PY|PS) (2.5)

for anydistributionPS with independentcomponents(Cover& Thomas,1991).Therefore,theminimuminPSof K(PY|PS) is reachedbytakingPS =PY sincethis choiceensuresK(PY|PS) = 0.Thevalueof φML at this pointthenis

φMI (Y)def= min

PS

K(PY|PS) = K(PY|PY). (2.6)

Page 4: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole

High-OrderContrastsfor IndependentComponentAnalysis 163

We usethe indexMI sincethis quantityis well known asthe mutual in-formationbetweentheentriesof Y. It wasfirst proposedby Comon(1994),andit canbeseenfromtheaboveasderivingfromtheML principlewhenoptimizationis with respectto boththeunknownsystemA andthedistri-butionof S. Thisconnectionwasalsonotedin ObradovicandDeco(1997),andtherelationbetweeninfomaxandmutualinformationisalsodiscussedin NadalandParga(1994).

2.5 Minimum Marginal Entropy. An orthogonalcontrastφ(Y) is, bydefinition, to beoptimizedundertheconstraintthat Y is spatiallywhite:orthogonalcontrastsenforcedecorrelation,thatis,anexact“second-order”independence.Any regularcontrastcanbeusedunderthewhitenesscon-straint,but by taking the whitenessconstraintinto account,the contrastmaybegivenasimplerexpression.Thisis thecaseof somecumulant-basedcontrastsdescribedin section3.It isalsothecaseof φMI (Y) becausethemu-tual informationcanalsobeexpressedasφMI (Y) =

∑ni=1H(PYi ) − H(PY);

sincethe entropy H(PY) is constantunderorthonormaltransforms,it isequivalentto consider

φME(Y) =n∑

i=1

H(PYi ) (2.7)

to be optimizedunderthe whitenessconstraintEYY† = I. This contrastcouldbecalledorthogonalmutualinformation,orthemarginalentropycon-trast.Theminimumentropyideaholdsmoregenerallyunderanyvolume-preservingtransform(Obradovic& Deco,1997).

2.6 Empirical Contrast Functions. Amongall theabovecontrasts,onlyφML or its orthogonalversionareeasilyoptimizedby agradienttechniquebecausetherelativegradientof φML simply is thematrix EH(Y) with H(·)definedin equation1.4.Therefore,therelativegradientalgorithm,equation1.1,canbeemployedusingeitherthisfunctionH(·) oritssymmetrizedform,equation1.5,if onechoosestoenforcedecorrelation.However, thiscontrastis basedon a prior guessPS aboutthedistributionof thecomponents.Iftheguessis toofaroff, thealgorithmwill fail todiscoverindependentcom-ponentsthat might be presentin the data.Unfortunately, evaluatingthegradientof contrastsbasedon mutualinformationor minimummarginalentropy is more difficult becauseit doesnot reduceto theexpectationofa simplefunctionof Y; for instance,Pham(1996)minimizesexplicitly themutualinformation,but thealgorithminvolvesa kernelestimationof themarginaldistributionsof Y.An intermediateapproachistoconsiderapara-metricestimationof thesedistributionsasin Moulines,Cardoso,andGas-siat(1997)or PearlmutterandParra(1996),for instance.Therefore,all thesecontrastsrequirethatthedistributionsof componentsbeknown,orapprox-

164 Jean-FrancoisCardoso

imatedor estimated.As we shallseenext,this is alsowhat thecumulantapproximationsto contrastfunctionsareimplicitly doing.

3 Cumulants

This sectionpresentshigher-order approximationsto entropic contrasts,someknownandsomenovel.To keeptheexpositionsimple,it is restrictedto symmetricdistributions(for whichodd-ordercumulantsareidenticallyzero) andto cumulantsof orders2and4.Recallthatfor randomvariables

X1, . . . , X4,second-ordercumulantsareCum(X1, X2)def= EX1X2whereXi

def=Xi − EXi andthefourth-ordercumulantsare

Cum(X1, X2, X3, X4) =EX1X2X3X4 − EX1X2EX3X4 − EX1X3EX2X4 − EX1X4EX2X3. (3.1)

Thevarianceandthekurtosisof arealrandomvariableX aredefinedas

σ2(X)def= Cum(X, X) = EX2,

k(X)def= Cum(X, X, X, X) = EX4 − 3E2X2, (3.2)

that is, theyare thesecond-andfourth-orderautocumulants.A cumulantinvolving at leasttwo differentvariablesis calledacross-cumulant.

3.1 Cumulant-Based Approximations to Entropic Contrasts. Cumu-lantsare usefulin manyways.In this section,theyshowup becausetheprobabilitydensityof ascalarrandomvariableU closeto thestandard nor-maln(u) = (2π)−1/2exp−u2/2canbeapproximatedas

p(u) ≈ n(u)

(1+ σ2(U) − 1

2h2(u) + k(U)

4!h4(u)

), (3.3)

whereh2(u) = u2− 1andh4(u) = u4− 6u2+ 3,respectively, arethesecond-andfourth-orderHermitepolynomials.Thisexpressionis obtainedby re-taining the leadingtermsin an Edgeworthexpansion(McCullagh,1987).If U andV are two real randomvariableswith distributionscloseto thestandard normal,onecan,at leastformally, useexpansion3.3to deriveanapproximationto K(PU|PV). Thisis

K(PU|PV) ≈ 1

4(σ2(U) − σ2(V))2 + 1

48(k(U) − k(V))2, (3.4)

whichshowshowthepair(σ2, k) ofcumulantsoforder2and4playin somesensetheroleof a localcoordinatesystemaroundn(u) with thequadratic

Page 5: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole

High-OrderContrastsfor IndependentComponentAnalysis 165

form 3.4playingtherole of a local metric.This resultgeneralizesto mul-tivariates,in whichcasewedenotefor concisenessRU

ij = Cum(Ui, Uj) and

QUijkl = Cum(Ui, Uj, Uk, Ul) andsimilarly for anotherrandomn-vectorV

with entriesV1, . . . , Vn. We give without proof the following approxima-tion:

K(PU|PV) ≈ K24(PU|PV)def= 1

4

ij

(RU

ij − RVij

)2

+ 1

48

ijkl

(QU

ijkl − QVijkl

)2. (3.5)

Expression3.5turnsout to bethesimplestpossiblemultivariategeneral-izationof equation3.4(thetwo termsin equation3.5areadoublesumoverall then2 pairsof indicesandaquadruplesoverall then4 quadruplesof in-dices).Sincetheentropiccontrastslistedabovehaveall beenderivedfromthe Kullback divergence,cumulantapproximationsto all thesecontrastscanbeobtainedby replacingtheKullbackmismatchK(PU|PV) byacrudermeasure:its approximationis acumulantmismatchby equation3.5.

3.1.1 ApproximationtotheLikelihoodContrast. Theinfomax-MLcontrastφML(Y) = K(PY|PS) for ICA (seeequation2.4)is readilyapproximatedbyusingexpression3.5.TheassumptionPS on the distributionof S is nowreplacedby anassumptionaboutthecumulantsof S. Thisamountsto verylittle: all thecross-cumulantsof Sbeing0thanksto theassumptionof inde-pendentsources,it is neededonly to specifytheautocumulantsσ2(Si) andk(Si). Thecumulantapproximation(seeequation3.5) to the infomax-MLcontrastbecomes:

φML(Y) ≈ K24(PY|PS) = 1

4

ij

(RY

ij − σ2(Si)δij

)2

+ 1

48

ijkl

(QY

ijkl − k(Si)δijkl

)2, (3.6)

wheretheKroneckersymbolδ equals1with identicalindicesand0other-wise.

3.1.2 Approximationto theMutual InformationContrast. Themutualin-formationcontrastφMI (Y) wasobtainedby minimizing K(PY|PS) overallthe distributionsPS with independentcomponents.In the cumulantap-proximation,thisis trivially done:thefreeparametersforPS areσ2(Si) andk(Si).Eachof thesescalarsentersin onlyonetermof thesumsin equation3.6sothat theminimizationis achievedfor σ2(Si) = RY

ii andk(Si) = QYiiii . In

otherwords,theconstructionof thebestapproximatingdistributionwith

166 Jean-FrancoisCardoso

independentmarginalsPY, which appearsin equation2.5,boils down,inthecumulantapproximation,to theestimationof thevarianceandkurtosisof eachentryof Y. Fitting bothσ2(Si) andk(Si) to RY

ii andQYiii , respectively,

hastheeffectofexactlycancellingthediagonaltermsinequation3.6,leavingonly

φMI (Y) ≈ φMI24 (Y)

def= 1

4

ij 6=ii

(RY

ij

)2+ 1

48

ijkl 6=iiii

(QY

ijkl

)2, (3.7)

whichisourcumulantapproximationto themutualinformationcontrastinequation2.6.Thefirst termisunderstoodasthesumoverall thepairsofdis-tinct indices;thesecondtermisasumoverall quadruplesof indicesthatarenotall identical.It containsonlyoff-diagonalterms,thatis,cross-cumulants.Sincecross-cumulantsof independentvariablesidenticallyvanish,it is notsurprisingtoseethemutualinformationapproximatedbyasumofsquaredcross-cumulants.

3.1.3 Approximationto theOrthogonalLikelihoodContrast. Thecumulantapproximationto theorthogonallikelihoodis fairly simple.Theorthogonalapproachconsistsof first enforcing thewhitenessof Y that is RY

ij = δij or

RY = I. Inotherwords,it consistsofnormalizingthecomponentsbyassum-ingthatσ2(Si) = 1andmakingsurethesecond-ordermismatchiszero.Thisisequivalentto replacingtheweight 1

4 in equation3.6byaninfiniteweight,hencereducingtheproblemto theminimization(underthewhitenesscon-straint)of the fourth-order mismatch,or the second(quadruple) suminequation3.6.Thus,theorthogonallikelihoodcontrastis approximatedby

φOML24 (Y)

def= 1

48

ijkl

(QY

ijkl − k(Si)δijkl

)2. (3.8)

Thiscontrasthasaninterestingalternateexpression.Developingthesquaresgives

φOML24 (Y) = 1

48

ijkl

(QYijkl)

2 + 1

48

ijkl

k2(Si)δ2ijkl − 2

48

ijkl

k(Si)δijklQYijkl.

Thefirstsumaboveisconstantunderthewhitenessconstraint(thisisreadilycheckedusingequation3.13for anorthonormaltransform),andthesecondsum doesnot dependon Y; finally the last sum containsonly diagonalnonzero terms.It follows that:

φOML24 (Y)

c= − 1

24

i

k(Si)QYiiii

= − 1

24

i

k(Si)k(Yi)c= − 1

24

i

k(Si)EY4i , (3.9)

Page 6: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole

High-OrderContrastsfor IndependentComponentAnalysis 167

where· c= · denotesanequalityuptoaconstant.An interpretationof thesec-ondequalityisthatthecontrastisminimizedbymaximizingthescalarprod-uctbetweenthevector[k(Y1), . . . , k(Yn)] of thekurtosisof thecomponentsandthe correspondingvectorof hypothesizedkurtosis[k(S1), . . . , k(Sn)].Thelastequalitystemsfrom thedefinition in equation3.2of thekurtosisandtheconstancyof EY2

i underthewhitenessconstraint.This lastform is

remarkablebecauseit showsthat for zero-meanobservations,φOML24 (Y)

c=El(Y),wherel(Y) = − 1

24

∑i k(Si)Y4

i ,sothecontrastisjusttheexpectationofasimplefunctionof Y.Wecanexpectsimpletechniquesfor itsmaximization.

3.1.4 Approximationto theMinimumMarginal EntropyContrast. Underthewhitenessconstraint,thefirstsumin theapproximation,equation3.7,iszero (this is thewhitenessconstraint)sothattheapproximationto mutualinformationφMI (Y) reducesto thelastterm:

φME(Y) ≈ φME24 (Y)

def= 1

48

ijkl 6=iiii

(QY

ijkl

)2 c= − 1

48

i

(QY

iiii

)2. (3.10)

Again, thelastequalityup to constantfollows from theconstancyof∑

ijkl

(QYijkl)

2 underthewhitenessconstraint.Theseapproximationshadalreadybeenobtainedby Comon(1994)from anEdgeworthexpansion.Theysaysomethingsimple:Edgeworthexpansionssuggesttestingtheindependencebetweentheentriesof Y by summingup all thesquaredcross-cumulants.

In thecourseof this article,we will find two similar contrastfunctions.TheJADEcontrast,

φJADE(Y)def=

ijkl 6=iikl

(QY

ijkl

)2, (3.11)

alsois a sumof squaredcross-cumulants(thenotationindicatesa sumisoverall thequadruples(ijkl) of indiceswith i 6= j). Its interestis to bealsoacriterionof joint diagonalityof cumulantsmatrices.TheSHIBBScriterion,

φSH(Y)def=

ijkl 6=iikk

(QY

ijkl

)2, (3.12)

is alsointroducedin section4.3asgoverninga similar but lessmemory-demandingalgorithm.It also involvesonly cross-cumulants:thosewithindices(ijkl) suchthat i 6= j or k 6= l.

3.2 Cumulants and Algebraic Structures. Previoussectionsreviewedthe useof cumulantsin designingcontrastfunctions.Another threadofideasusingcumulantsstemsfrom the methodof moments.Suchan ap-proachis calledfor by themultilinearity of thecumulants.Undera linear

168 Jean-FrancoisCardoso

transformY = BX,whichalsoreadsYi =∑

p bipXp, thecumulantsof order4(for instance)transformas:

Cum(Yi, Yj, Yk, Yl) =∑

pqrs

bipbjqbkrblsCum(Xp, Xq, Xr, Xs), (3.13)

whichcaneasilybeexploitedfor ourpurposessincetheICA modelislinear.Usingthis factandtheassumptionof independenceby whichCum(Sp, Sq,

Sr, Ss) = k(Sp)δ(p, q, r, s),wereadilyobtainthesimplealgebraicstructureofthecumulantsof X = ASwhenShasindependententries,

Cum(Xi, Xj, Xk, Xl) =n∑

u=1

k(Su)aiuajuakualu, (3.14)

whereaij denotesthe(ij)th entryof matrix A. WhenestimatesCum(Xi, Xj,

Xk, Xl) areavailable,onemaytry to solveequation3.4in thecoefficientsaij

of A. Thisis tantamountto cumulantmatchingontheempiricalcumulantsof X.Becauseof thestrongalgebraicstructureof equation3.14,onemaytryto devisefourth-orderfactorizationsakin to thefamiliar second-ordersin-gularvaluedecomposition(SVD)or eigenvaluedecomposition(EVD) (seeCardoso,1992;Comon,1997;DeLathauwer,DeMoor,& Vandewalle,1996).However, theseapproachesaregenerallynotequivalenttotheoptimizationof acontrastfunction,resultingin estimatesthataregenerallynotequivari-ant(Cardoso,1995).Thispoint is illustratedbelow;weintroducecumulantmatriceswhosesimplestructureoffersstraightforward identificationtech-niques,but we stress,asoneof their importantdrawbacks,their lack ofequivariance.However, weconcludeby showinghow thealgebraicpointof view andthestatistical(equivariant)pointof view canbereconciled.

3.2.1 CumulantMatrices. Thealgebraicnatureof cumulantsis tensorial(McCullagh,1987),butsincewewill concernourselvesmainlywith second-andfourth-orderstatistics,amatrix-basednotationsufficesfor thepurposeof ourexposition;weonly introducethenotionof cumulantmatrixdefinedasfollows. Givena randomn × 1 vectorX andany n × n matrix M, wedefinetheassociatedcumulantmatrix QX(M) asthen × n matrix definedcomponent-wiseby

[QX(M)]ijdef=

n∑

k,l=1

Cum(Xi, Xj, Xk, Xl)Mkl. (3.15)

If X is centered,thedefinitionin equation3.1showsthat

QX(M) = E{(X†MX) XX†} − RXtr(MRX) − RXMRX − RXM†RX,(3.16)

Page 7: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole

High-OrderContrastsfor IndependentComponentAnalysis 169

where tr(·) denotesthe traceandRX denotesthe covariancematrix of X,that is, [RX]ij = Cum(Xi, Xj). Equation3.16couldhavebeenchosenasanindex-freedefinitionof cumulantmatrices.It showsthatagivencumulantmatrixcanbecomputedor estimatedatacostsimilarto theestimationcostof acovariancematrix;thereisnoneedto computethewholesetof fourth-ordercumulantsto obtainthevalueof QX(M) for a particularvalueof M.Actually, estimatinga particularcumulantmatrix is oneway of collectingpartof thefourth-orderinformationin X; collectingthewholefourth-orderinformationrequirestheestimationof O(n4) fourth-ordercumulants.

Thestructure of a cumulantmatrix QX(M) in the ICA modelis easilydeducedfromequation3.14:

QX(M) = A1(M)A†

1(M) = Diag(k(S1) a†

1Ma1, . . . , k(Sn) a†nMan

), (3.17)

whereai denotestheith columnof A, thatis,A = [a1, . . . , an]. In thisfactor-ization,the(generallyunknown)kurtosisenteronly in thediagonalmatrix1(M), a fact implicitly exploitedby thealgebraictechniquesdescribedbe-low.

3.3 Blind Identification UsingAlgebraic Structures. Insection3.1,con-trast functionswere derivedfrom the ML principle assumingthe modelX = AS. In thissection,weproceedsimilarly: weconsidercumulant-basedblind identificationof A assumingX = ASfrom which thestructures3.14and3.17result.

Recallthattheorthogonalapproachcanbeimplementedbyfirstsphering

explicitly vectorX. LetW beawhitening,anddenoteZdef= WX thesphered

vector.Withoutlossofgenerality, themodelcanbenormalizedbyassumingthat the entriesof S haveunit varianceso that S is spatiallywhite. Since

Z = WX = WASisalsowhitebyconstruction,thematrixUdef= WAmustbe

orthonormal:UU† = I. ThereforespheringyieldsthemodelZ = USwithU orthonormal.Of course,this is still amodelof independentcomponentssothat,similar to equation3.17,wehavefor anymatrixM thestructureofthecorrespondingcumulantmatrixof Z,

QZ(M) = U1(M)U†

1(M) = Diag(k(S1) u†

1Mu1, . . . , k(Sn) u†nMun

), (3.18)

where ui denotesthe ith columnof U. In a practicalorthogonalstatistic-basedtechnique,onewould first estimatea whiteningmatrix W, estimatesomecumulantsof Z = WX, computean orthonormalestimateU of Uusingthesecumulants,andfinally obtainanestimateA of A asA = W−1Uor obtainaseparatingmatrixasB = U−1W = U†W.

170 Jean-FrancoisCardoso

3.3.1 NonequivariantBlind IdentificationProcedures. Wefirstpresenttwoblind identificationproceduresthatexploitin astraightforward mannerthestructure3.17;weexplainwhy, in spiteof attractivecomputationalsimplic-ity, theyarenotwell behaved(notequivariant)andhow theycanbefixedfor equivariance.

Thefirst ideais not basedon anorthogonalapproach.Let M1 andM2

be two arbitrary n × n matrices,and defineQ1def= QX(M1) and Q2

def=QX(M2). According to equation3.17,if X = ASwehaveQ1 = A11A† and

Q2 = A12A† with 11 and12 two diagonalmatrices.Thus,Gdef= Q1Q

−12 =

(A11A†)(A12A†)−1 = A1A−1, where 1 is thediagonalmatrix 111−12 . It

follows thatGA = A1, meaningthatthecolumnsof A aretheeigenvectorsof G (possiblyup to scalefactors).

An extremelysimplealgorithmfor blind identificationof A follows: Se-lect two arbitrarymatricesM1 andM2; computesampleestimatesQ1 andQ2 usingequation3.16;find thecolumnsof A astheeigenvectorsof Q1Q

−12 .

Thereisat leastoneproblemwith thisidea:wehaveassumedinvertiblema-tricesthroughoutthederivation,andthismayleadto instability. However,thisspecificproblemmaybefixedby sphering,asexaminednext.

Considernowtheorthogonalapproachasoutlinedabove.LetM besomearbitrarymatrix M, andnotethatequation3.18is aneigendecomposition:the columnsof U are the eigenvectorsof QZ(M), which are orthonormalindeedbecauseQZ(M) issymmetric.Thus,in theorthogonalapproach,an-otherimmediatealgorithmfor blind identificationis to estimateU asan(orthonormal)diagonalizerof anestimateof QZ(M). Thanksto sphering,problemsassociatedwith matrix inversiondisappear, but a deeperprob-lemassociatedwith thesesimplealgebraicideasremainsandmustbead-dressed.Recallthattheeigenvectorsareuniquelydetermined2 if andonlyif theeigenvaluesareall distinct.Therefore,weneedto makesurethattheeigenvaluesof QZ(M) areall distinctin orderto preserveblind identifiabil-ity basedonQZ(M). Accordingto equation3.18,theseeigenvaluesdependon the(sphered)system,which is unknown.Thus,it is not possibleto de-termineapriori if agivenmatrix M correspondsto distincteigenvaluesofQZ(M). Of course,if M is randomlychosen,thentheeigenvaluesare dis-tinct with probability1,butweneedmorethanthis in practicebecausethealgorithmsuseonly sampleestimatesof the cumulantmatrices.A smallerror in thesampleestimateof QZ(M) caninducea largedeviationof theeigenvectorsif theeigenvaluesarenot well enoughseparated.Again, thisis impossibleto guaranteea priori becauseanappropriateselectionof Mrequiresprior knowledgeabouttheunknownmixture.

In summary, thediagonalizationof asinglecumulantmatrixiscomputa-

2 In fact,determinedonly up to permutationsandsignsthatdonotmatterin anICAcontext.

Page 8: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole

High-OrderContrastsfor IndependentComponentAnalysis 171

tionallyattractiveandcanbeprovedtobealmostsurelyconsistent,but it isnotsatisfactorybecausethenondegeneracyof thespectrumcannotbecon-trolled.Asaresult,theestimationaccuracyfromafinitenumberof samplesdependson the unknownsystemandis therefore unpredictablein prac-tice; this lack of equivarianceis hardly acceptable.Onemayalsocriticizetheseapproacheson thegroundthat theyrely on only a smallpartof thefourth-orderinformation(summarizedin ann× n cumulantmatrix)ratherthantrying to exploit more cumulants(there are O(n4) fourth-order inde-pendentcumulantstatistics).Weexaminenexthowthesetwoproblemscanbealleviatedby jointly processingseveralcumulantmatrices.

3.3.2 RecoveringEquivariance. LetM = {M1, . . . , MP} beasetof Pma-

tricesof sizen × n anddenoteQidef= QZ(Mi) for 1 ≤ i ≤ P theassociated

cumulantmatricesfor thesphereddataZ = US. Again,asabove,for all iwehaveQi = U1iU† with 1i adiagonalmatrixgivenby equation3.18.Asameasureof nondiagonalityof amatrixF, defineOff(F) asthesumof thesquaresof thenondiagonalelements:

Off(F)def=

i 6=j

(fij

)2. (3.19)

We havein particularOff(U†QiU) = Off(1i) = 0 sinceQi = U1iU† andU†U = I. ForanymatrixsetM andanyorthonormalmatrixV, wedefinethefollowing nonnegativejoint diagonalitycriterion,

DM(V)def=

Mi∈MOff(V†QZ(Mi)V), (3.20)

which measureshow closeto diagonalityan orthonormalmatrix V cansimultaneouslybringthecumulantsmatricesgeneratedbyM.

To eachmatrix setM is associateda blind identificationalgorithmasfollows:(1)findaspheringmatrixW towhitenin thedataX intoZ = WX;(2)estimatethecumulantmatricesQZ(M) for all M ∈ Mbyasampleversionofequation3.16;(3)minimizethejointdiagonalitycriterion,equation3.20,thatis, makethecumulantmatricesasdiagonalaspossibleby anorthonormaltransformV; (4) estimateA asA = VW−1 or its inverseasB = V†W or thecomponentvectorasY = V†Z = V†WX.

Suchanapproachseemstobeabletoalleviatethedrawbacksmentionedabove.Findingtheorthonormaltransformastheminimizerofasetofcumu-lantmatricesgoesin theright directionbecauseit involvesalargernumberof fourth-orderstatisticsandbecauseit decreasesthelikelihoodof degener-atespectra.Thisargumentcanbemaderigorousbyconsideringamaximalsetof cumulantmatrices.By definition,this is asetobtainedwheneverMis anorthonormalbasisfor the linearspaceof n × n matrices.Sucha ba-siscontainsn2 matricessothatthecorrespondingcumulantmatricestotal

172 Jean-FrancoisCardoso

n2 × n2 = n4 entries,that is, asmanyasthewholefourth-ordercumulantset.Foranysuchmaximalset(Cardoso& Souloumiac,1993):

DM(V) = φJADE(Y) with Y = V†Z, (3.21)

whereφJADE(Y) is thecontrastfunctiondefinedat equation3.11.Thejointdiagonalizationof a maximalsetguaranteesblind identifiability of A ifk(Si) = 0for atmostoneentrySi of S(Cardoso& Souloumiac,1993).Thisisanecessaryconditionforanyalgorithmusingonlysecond-andfourth-orderstatistics(Comon,1994).

A keypointismadebyrelationship3.21.Wemanagedtoturnanalgebraicproperty(diagonality)of thecumulantsof the(sphered)observationsintoacontrastfunction—afunctionalof thedistributionof theoutputY = V†Z.This factguaranteesthattheresultingestimatesareequivariant(Cardoso,1995).

The price to pay with this techniquefor reconcilingthe algebraicap-proachwith thenaturallyequivariantcontrast-basedapproachis twofold:it entailsthe computationof a large (actually, maximal)setof cumulantmatricesandthejoint diagonalizationof P = n2 matrices,which is at leastascostlyasP timesthe diagonalizationof a singlematrix. However, theoverallcomputationalburdenmaybesimilar(seeexamplesin section5)tothecostof adaptivealgorithms.Thisisbecausethecumulantmatricesneedto beestimatedoncefor a givendatasetandbecauseit existsasa reason-ablyefficientjoint diagonalizationalgorithm(seesection4)thatisnotbasedon gradient-styleoptimization;it thuspreservesthepossibilityof exploit-ing theunderlyingalgebraicnatureof thecontrastfunction,equation3.11.Severaltricksfor increasingefficiencyarealsodiscussedin section4.

4 JacobiAlgorithms

Thissectiondescribesalgorithmsfor ICA sharinga commonfeature:a Ja-cobi optimizationof an orthogonalcontrastfunction asopposedto opti-mizationby gradient-likealgorithms.Theprincipleof Jacobioptimizationisappliedtoadata-basedalgorithm,astatistic-basedalgorithm,andamixedapproach.

The Jacobimethodis an iterative techniqueof optimizationover thesetof orthonormalmatrices.Theorthonormaltransformis obtainedasasequenceofplanerotations.Eachplanerotationisarotationappliedtoapairof coordinates(hencethename:therotationoperatesin atwo-dimensionalplane).If Y isann×1vector, the(i, j)thplanerotationbyanangleθij changesthecoordinatesi and j of Y according to

[Yi

Yj

]←

[cos(θij ) sin(θij )

− sin(θij ) cos(θij )

] [Yi

Yj

], (4.1)

whileleavingtheothercoordinatesunchanged.A sweepisonepassthrough

Page 9: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole

High-OrderContrastsfor IndependentComponentAnalysis 173

all the n(n − 1)/2 possiblepairsof distinct indices.This ideais classicinnumericalanalysis(Golub & Van Loan, 1989);it canbe considered in awidercontextfor theoptimizationofanyfunctionofanorthonormalmatrix.Comonintroducedthe Jacobitechniquefor ICA (seeComon,1994for adata-basedalgorithmandanearlierreferencein it for theJacobiupdateofhigh-ordercumulanttensors).Suchadata-basedJacobialgorithmfor ICAworks througha sequenceof Jacobisweepson the sphered datauntil agivenorthogonalcontrastφ(Y) is optimized.Thiscanbesummarizedas:

1. Initialization. ComputeawhiteningmatrixW andsetY = WX.

2. Onesweep. Forall n(n − 1)/2pairs,thatis for 1≤ i < j ≤ n, do:

a. ComputetheGivensangleθij , optimizingφ(Y) whenthepair(Yi, Yj) is rotated.

b. If θij < θmin,dorotatethepair(Yi, Yj) accordingtoequation4.1.

3. If nopairhasbeenrotatedin previoussweep,end.Otherwisegoto 2for anothersweep.

Thus,the Jacobiapproachconsidersa sequenceof two-dimensionalICAproblems.Of course,theupdatingstep2bon a pair (i, j) partially undoestheeffectof previousoptimizationsonpairscontainingeitheri or j.Forthisreason,it is necessaryto gothroughseveralsweepsbeforeoptimizationiscompleted.However,Jacobialgorithmsareoftenveryefficientandconvergein asmallnumberof sweeps(seetheexamplesin section5),andakeypointis thateachplanerotationdependsonasingleparameter, theGivensangleθij , reducingtheoptimizationsubproblemateachsteptoaone-dimensionaloptimizationproblem.

An importantbenefitof basingICA on fourth-ordercontrastsbecomesapparent:becausefourth-ordercontrastsarepolynomialin theparameters,theGivensanglescanoftenbefoundin closeform.

In theabovescheme,θmin is a smallangle,which controls theaccuracyof the optimization.In numericalanalysis,it is determinedaccording tomachineprecision.ForastatisticalproblemasICA, θmin shouldbeselectedinsuchawaythatrotationsbyasmalleranglearenotstatisticallysignificant.

In our experiments,we takeθmin to scaleas1/√

T, typically: θmin = 10−2√

T.

Thisscalingcanberelatedto theexistenceof a performanceboundin theorthogonalapproachto ICA(Cardoso,1994).Thisvaluedoesnotseemtobecritical, however, becausewehavefoundJacobialgorithmsto bevery fastatfinishing.

In theremainderof this section,we describethreepossibleimplemen-tationsof theseideas.Eachonecorrespondsto a differenttypeof contrastfunction andto different optionsaboutupdating.Section4.1describesadata-basedalgorithmoptimizingφOML(Y); section4.2describesastatistic-basedalgorithmoptimizingφJADE(Y);section4.3presentsamixedapproach

174 Jean-FrancoisCardoso

optimizingφSH(Y); finally, section4.4discussestherelationshipsbetweenthesecontrastfunctions.

4.1 A Data-BasedJacobiAlgorithm: MaxKurt. WestartbyaJacobitech-nique for optimizing the approximation,equation3.9, to the orthogonallikelihood.For thesakeof exposition,we considera simplifiedversionofφOML(Y) obtainedby settingk(S1) = k(S2) = . . . = k(Sn) = k, in whichcasetheminimizationof contrastfunction,equation3.9,isequivalentto theminimizationof

φMK(Y)def= −k

i

QYiiii . (4.2)

Thiscriterionis alsostudiedby MoreauandMacchi(1996),whoproposeatwo-stageadaptiveprocedurefor its optimization;it alsoservesasastart-ing point for introducingtheone-stageadaptivealgorithmof CardosoandLaheld(1996).

DenoteGij (θ) theplanerotationmatrix that rotatesthepair (i, j) by anangleθ asin step2babove.Thensimpletrigonometryyields:

φMK(Gij (θ)Y) = µij − kλij cos(4(θ − Äij )), (4.3)

whereµij doesnotdependonθ andλij isnonnegative.Theprincipaldeter-minationof angleÄij is characterizedby

Äij = 1

4arctan

(4QY

iii j − 4QYijjj , QY

iiii + QYjjjj − 6QY

iijj

), (4.4)

wherearctan(y, x) denotestheangleα ∈ (−π, π ] suchthatcos(α) = x√x2+y2

andsin(α) = y√x2+y2

. If Y is a zero-meansphered vector, expression4.3

furthersimplifiesto

Äij = 1

4arctan

(4E

(Y3

i Yj − YiY3j

), E

((Y2

i − Y2j )

2 − 4Y2i Y

2j

) ). (4.5)

Thecomputationsaregivenin theappendix.It is now immediateto mini-mizeφMK(Y) for eachpair of componentsandfor eitherchoiceof thesignof k. If onelooksfor componentswith positivekurtosis(oftencalledsuper-gaussian),theminimizationof φMK(Y) is identicalto themaximizationofthesumof thekurtosisof thecomponentssincewehavek > 0in thiscase.TheGivensanglesimply is θ = Äij sincethis choicemakesthecosineinequation4.2equalto its maximumvalue.

We referto theJacobialgorithmoutlinedaboveasMaxKurt. A Matlabimplementationis listed in the appendix,whosesimplicity is consistentwith thedata-basedapproach.Note,however, thatit is alsopossibleto use

Page 10: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole

High-OrderContrastsfor IndependentComponentAnalysis 175

thesamecomputationsin astatistic-basedalgorithm.Ratherthanrotatingthedatathemselvesat eachstepby equation4.1,oneinsteadupdatesthesetof all fourth-ordercumulantsaccordingto thetransformationlaw,equa-tion 3.13,with theGivensanglefor eachpairstill givenby equation4.3.Inthiscase,thememoryrequirementis O(n4) for storingall thecumulantsasopposedto nT for storingthedataset.

Thecasek < 0 where, looking for light-tailedcomponents,oneshouldminimizethesumof thekurtosisissimilar.Thisapproachcouldbeextendedtokurtosisof mixedsignsbutthecontrastfunctionthenhaslesssymmetry.Thisis not includedin thisarticle.

4.1.1 Stability. Whatis theeffectof theapproximationof equalkurtosismadeto derivethesimplecontrastφMK(Y)?WhenX = ASwith Sof inde-pendentcomponents,wecanatleastusethestabilityresultof CardosoandLaheld(1996),whichappliesdirectlyto thiscontrast.Definethenormalizedkurtosisasκi = σ−4

i k(Si). ThenB = A−1 is a stablepoint of thealgorithmwith k > 0 if κi + κj > 0for all pairs1 ≤ i < j ≤ n. Thesameconditionalsoholdswith all signsreversedfor componentswith negativekurtosis.

4.2 A Statistic-BasedAlgorithm: JADE. ThissectionoutlinestheJADEalgorithm(Cardoso& Souloumiac,1993),which is specificallya statistic-basedtechnique.Wedonotneedtogointomuchdetailbecausethegeneraltechniquefollowsdirectlyfromtheconsiderationsof section3.3.TheJADEalgorithmcanbesummarizedas:

1. Initialization. EstimateawhiteningmatrixW andsetZ = WX.

2. Formstatistics. Estimateamaximalset{QZi } of cumulantmatrices.

3. Optimizean orthogonalcontrast. Find the rotationmatrix V suchthatthecumulantmatricesareasdiagonalaspossible,that is, solveV =argmin

∑i Off(V†QZ

i V).

4. Separate. EstimateA asA = VW−1 and/orestimatethecomponentsasS= A−1X = V†Z.

This is a Jacobialgorithmbecausethejoint diagonalizerat step3 is foundby a Jacobitechnique.However, theplanerotationsareappliednot to thedata(which are summarizedin the cumulantmatrices)but to the cumu-lantmatricesthemselves;thealgorithmupdatesnotdatabutmatrix-valuedstatisticsof thedata.As with MaxKurt, theGivensangleat eachstepcanbecomputedin closedform evenin thecaseof possiblycomplexmatrices(Cardoso& Souloumiac,1993).Theexplicit expressionfor theGivensan-glesisnotparticularlyenlighteningandisnotreportedhere.(TheinterestedreaderisreferredtoCardoso& Souloumiac,1993,andmayrequestaMatlabimplementationfromtheauthor.)

176 Jean-FrancoisCardoso

A keyissueistheselectionof thecumulantmatricestobeinvolvedin theestimation.As explainedin section3.2,the joint diagonalizationcriterion∑

i Off(V†QZi V) ismadeidenticalto thecontrastfunction,equation3.11,by

usingamaximalsetof cumulantmatrices.This is abit surprisingbutveryfortunate.We do not know of anyotherway for a priori selectingcumu-lant matricesthat would offer sucha property(but seethe nextsection).In anycase,it guaranteesequivariantestimatesbecausethealgorithm,al-thoughoperatingonstatisticsof thesphereddata,alsooptimizesimplicitlya functionof Y = V†Z only.

Before proceeding,we notethat truecumulantmatricescanbeexactlyjointly diagonalizedwhenthemodelholds,but this is no longerthecasewhenweprocessrealdata.First,onlysamplestatisticsareavailable;second,the modelX = AS with independententriesin S cannotbe expectedtohold accuratelyin general.This is anotherreasonthat it is importanttoselectcumulantmatricessuchthat

∑i Off(V†QZ

i V) is a contrastfunction.In thiscase,theimpossibilityof anexactjoint diagonalizationcorrespondsto the impossibilityof finding Y = BX with independententries.Makingamaximalsetof cumulantmatricesasdiagonalaspossiblecoincideswithmakingthe entriesof Y asindependentaspossibleasmeasured by (thesampleversionof) criterion3.11.

Thereareseveraloptionsfor estimatingamaximalsetofcumulantmatri-ces.Recallthatsuchasetisdefinedas{QZ(Mi)|i = 1, n2}where{Mi|i = 1, n2}isanybasisfor then2-dimensionallinearspaceofn×nmatrices.A canonicalbasisfor thisspaceis {epe†

q|1≤ p, q≤ n}, whereep is acolumnvectorwith a1 in pth positionand0’selsewhere.It is readilycheckedthat

[QZ(epe†q)]ij = Cum(Zi, Zj, Zp, Zq). (4.6)

In otherwords,theentriesof thecumulantmatricesfor thecanonicalbasisarejustthecumulantsofZ.A betterchoiceistoconsiderasymmetric/skew-symmetricbasis.DenoteMpq ann×n matrixdefinedasfollows:Mpq = epe†

p

if p = q,Mpq = 2−1/2(epe†q+eqe†

p) if p < qandMpq = 2−1/2(epe†q−eqe†

p) if p > q.ThisisanorthonormalbasisofRn×n.Wenotethatbecauseof thesymmetriesof thecumulantsQZ(epe†

q) = QZ(eqe†p) so that QZ(Mpq) = 2−1/2QZ(epe†

q) if

p < q andQZ(Mpq) = 0 if p > q. It follows that the cumulantmatricesQZ(Mpq) for p > q do not evenneedto be computed.Being identicallyzero, theydo not enterin thejoint diagonalizationcriterion.It is thereforesufficienttoestimateandtodiagonalizen+n(n−1)/2(symmetric)cumulantmatrices.

Thereisanotherideatoreducethesizeofthestatisticsneededtorepresentexhaustivelythe fourth-order information.It is, however, applicableonlywhenthemodelX = ASholds.In thiscase,thecumulantmatricesdohavethestructureshownat equation3.18,andtheir sampleestimatesarecloseto it for largeenoughT. ThenthelinearmappingM → QZ(M) hasrankn

Page 11: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole

High-OrderContrastsfor IndependentComponentAnalysis 177

(moreprecisely, itsrankisequalto thenumberof componentswith nonzerokurtosis)becausetherearen lineardegreesof freedomfor matricesin theform U1U†, namely, then diagonalentriesof 1. From this fact andfromthe symmetriesof the cumulants,it follows that it existsn eigenmatricesE1, . . . , En, which are orthonormal,andsatisfiesQZ(Ei) = µiEi where thescalarµi is thecorrespondingeigenvector. ThesematricesE1, . . . , En spantherangeof themappingM → QZ(M), andanymatrix M orthogonaltothemis in thekernel,thatis,QZ(M) = 0.Thisshowsthatall theinformationcontainedin QZ canbesummarizedby then eigenmatricesassociatedwiththen nonzero eigenvalues.By insertingM = uiu†

i in theexpressions3.18andusingtheorthonormalityof thecolumnsof U (that is, u†

i uj = δij ), it isreadilycheckedthatasetof eigenmatricesis {Ei = uiu†

i }.TheJADEalgorithmwasoriginally introducedasperformingICA by a

joint approximatediagonalizationof eigenmatricesin CardosoandSoulou-miac(1993),whereweadvocatedthejointdiagonalizationofonlythenmostsignificanteigenmatricesofQZ asadevicetoreducethecomputationalload(eventhoughtheeigenmatricesareobtainedattheextracostof theeigende-compositionof ann2× n2 arraycontainingall thefourth-ordercumulants).Thenumberof statisticsis reducedfrom n4 cumulantsor n(n + 1)/2 sym-metriccumulantmatricesofsizen×n toasetofneigenmatricesofsizen×n.Sucha reductionis achievedat nostatisticalloss(at leastfor largeT) onlywhenthemodelholds.Therefore,wedonotrecommendreductiontoeigen-matriceswhenprocessingdatasetsfor whichit isnotclearapriori whetherthemodelX = ASactuallyholdsto goodaccuracy. Westill referto JADEastheprocessof jointly diagonalizingamaximalsetof cumulantmatrices,evenwhenit isnot furtherreducedto then mostsignificanteigenmatrices.It shouldalsobepointedout that thedeviceof truncatingthe full cumu-lantsetby reductionto themostsignificantmatricesis expectedto destroytheequivariancepropertywhenthemodeldoesnothold.Thenextsectionshowshowtheseproblemscanbeovercomein atechniqueborrowingfromboththedata-basedapproachandthestatistic-basedapproach.

4.3 A Mixed Approach: SHIBBS. In theJADEalgorithm,amaximalsetof cumulantmatricesis computedasa way to ensure equivariancefromthejoint diagonalizationof a fixed setof cumulantmatrices.As a benefit,cumulantsare computedonly oncein a singlepassthroughthedataset,andtheJacobiupdatesareperformedonthesestatisticsratherthanonthewholedataset.This is agoodthing for datasetswith a largenumberT ofsamples.Ontheotherhand,estimatingamaximalsetrequiresO(n4T) oper-ations,andits storagerequiresO(n4) memorypositions.Thesefigurescanbecomeprohibitivewhenlookingfor alargenumberofcomponents.In con-trast,gradient-basedtechniqueshaveto storeandupdatenTsamples.Thissectiondescribesa techniquestandingbetweenthetwo extremepositionsrepresentedby theall-statisticapproachandtheall-dataapproach.

178 Jean-FrancoisCardoso

Recallthat an algorithmis equivariantassoonasits operationcanbeexpressedonly in termsof the extractedcomponentsY (Cardoso,1995).Thissuggeststhefollowing technique:

1. Initialization. Selecta fixed setM = {M1, . . . , MP} of n × n matrices.EstimateawhiteningmatrixW andsetY = WX.

2. Estimatearotation. Estimatetheset{QY(Mp)|1≤ p ≤ P} of Pcumulantmatricesandfind ajoint diagonalizerV of it.

3. Update. If V iscloseenoughtotheidentitytransform,stop.Otherwise,rotatethedata:Y ← V†Y andgoto 2.

Suchanalgorithmisequivariantthankstothereestimationof thecumulantsofYafterupdating.It isin somesensedatabasedsincetheupdatingin step3isonthedatathemselves.However, therotationmatrixto beappliedto thedatais computedin step2asin astatistic-basedprocedure.

Whatwouldbeagoodchoicefor thesetM?Thesetof n matricesM ={e1e†

1, . . . , ene†n} seemsa naturalchoice:it is anorderof magnitudesmaller

than the maximalset,which containsO(n2) matrices.The kth cumulantmatrix in sucha setis QY(eke†

k), andits (i, j)th entry is Cum(Yi, Yj, Yk, Yk),which is just ann × n squareblockof cumulantsof Y. Wecall thesetof ncumulantmatricesobtainedin this way whenk is shiftedfrom 1 to n thesetof SHIftedBlocksfor Blind Separation(SHIBBS),andwe usethesamenamefor theICA algorithmthatdeterminestherotationV by aniterativejoint diagonalizationof theSHIBBSset.

Strikinglyenough,thesmallSHIBBSsetguaranteesaperformanceiden-ticaltoJADEwhenthemodelholdsfor thefollowing reason.Considerthefinalstepof thealgorithmwhere Y is closeto S if it holdsthatX = ASwith Sof independentcomponents.ThenthecumulantmatricesQY(epe†

q) arezerofor p 6= qbecauseall thecross-cumulantsof Y arezero.Therefore,theonlynonzerocumulantmatricesusedin themaximalsetof JADEarethosecorre-spondingtoep = eq, i.e.preciselythoseincludedinSHIBBS.ThustheSHIBBSsetactuallytendsto thesetof “significanteigen-matrices”exhibitedin theprevioussection.In this sense,SHIBBSimplementsthe original programof JADE—thejoint diagonalizationof thesignificanteigen-matrices—butitdoessowithout going throughtheestimationof thewholecumulantsetandthroughthecomputationof its eigen-matrices.

Doesthe SHIBBSalgorithmcorrespondto the optimizationof a con-trastfunction?We cannotresortto the equivalenceof JADE andSHIBBSbecauseit is establishedonly whenthe modelholdsandwe are lookingfor a statementindependentof this later fact.Examinationof the joint di-agonalitycriterionfor theSHIBBSsetsuggeststhat theSHIBBStechniquesolvestheproblemof optimizing thecontrastfunctionφSH(Y) definedinequation3.12.As a matterof fact,theconditionfor a givenY to bea fixed

Page 12: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole

High-OrderContrastsfor IndependentComponentAnalysis 179

pointof theSHIBBSalgorithmis thatfor anypair1≤ i < j ≤ n:

k

Cum(Yi, Yj, Yk, Yk) (Cum(Yi, Yi, Yk, Yk)

− Cum(Yj, Yj, Yk, Yk))

= 0, (4.7)

andwe canprovethatthis is alsothestationarityconditionof φSH(Y). Wedonot includetheproofsof thesestatements,whicharepurely technical.

4.4 Comparing Fourth-Order Orthogonal Contrasts. Wehaveconsid-eredtwo approximations,φJADE(Y) andφSH(Y), to theminimummarginalentropy/mutualinformationcontrastφMI (I), which are basedon fourth-ordercumulantsandcanbeoptimizedby Jacobitechnique.Theapproxi-mationφME

24 (Y) proposedby Comonalsobelongsto thiscategory. Onemaywonderabouttherelativestatisticalmeritsof thesethreeapproximations.ThecontrastφME

24 (Y) stemsfromanEdgeworthexpansionfor approximat-ing φME(Y), whichin turnhasbeenshownto derivefromtheML principle(seesection2).SinceML estimationoffers(asymptotic)optimalityproper-ties,onemaybetemptedtoconcludetothesuperiorityofφME

24 (Y).However,this is not thecase,asdiscussednow.

First, when the ICA model holds, it canbe shownthat eventhoughφME

24 (Y) andφJADE(Y) aredifferentcriteria,theyhavethesameasymptoticperformancewhen appliedto samplestatistics(Souloumiac& Cardoso,1991).This is alsotrueof φSH(Y) sincewehaveseenthatit is equivalenttoJADEin thiscase(amorerigorousproof is possible,basedonequation4.7,but is not included).

Second,whentheICA modeldoesnothold,thenotionof identificationaccuracydoesnot makesenseanymore,but onewould certainlyfavor anorthogonalcontrastreachingits minimum at a point ascloseaspossibleto the point where the “true” mutual informationφME(Y) is minimized.However, it seemsdifficult to find asimplecontrast(suchasthoseconsid-ered here) that would be a goodapproximationto φME(Y) for any wideclassof distributionsof X. Notethat theML argumentin favor of φME

24 (Y)

is basedon an Edgeworthexpansionthat is valid for “almost gaussian”distributions—thosedistributionsthatmakeICA very difficult andof du-bioussignificance:In practice,ICA shouldberestrictedto datasetswherethecomponentsshowasignificantamountof nongaussianity, inwhichcasetheEdgeworthexpansionscannotbeexpectedto beaccurate.

ThereisanotherwaythanEdgeworthexpansionfor arrivingatφME24 (Y).

Considercumulantmatching:the matchingof the cumulantsof Y to thecorrespondingcumulantsof ahypotheticalvectorSwith independentcom-ponents.TheorthogonalcontrastfunctionsφJADE(Y), φSH(Y), andφME

24 (Y)

canbeseenasmatchingcriteriabecausetheypenalizethedeviationof thecross-cumulantsof Y from zero (which is thevalueof cross-cumulantsofa vectorS with independentcomponents,indeed),andtheydo sounder

180 Jean-FrancoisCardoso

theconstraintthatY is white—thatis, by enforcing anexactmatchof thesecond-ordercumulantsof Y.

It is possibleto deviseanasymptoticallyoptimalmatchingcriterionbytakingintoaccountthevariabilityof thesampleestimatesof thecumulants.Sucha computationis reportedin Cardosoet al. (1996)for the matchingof all second-andfourth-ordercumulantsof complex-valuedsignals,butasimilarcomputationis possiblefor real-valuedproblems.It showsthattheoptimalweightingof thecross-cumulantsdependson thedistributionsofthecomponentssothat the“flat weighting” of all thecross-cumulants,asin equation3.10,is not thebestonein general.However, in thelimit of “al-mostgaussian”signals,theoptimalweightstendto valuescorrespondingpreciselyto thecontrastK24(Y|S) definedin equation3.5.Thisis notunex-pectedandconfirmsthat thecrudecumulantexpansionusedin derivingequation3.5is sensible,thoughnot optimal for significantlynongaussiancomponents.

It seemsfromthedefinitionsof φJADE(Y), φSH(Y), andφME24 (Y) thatthese

differentcontrastsinvolve different typesof cumulants.This is, however,an illusion becausethe compactdefinitionsgivenabovedo not takeintoaccountthesymmetriesof thecumulants:thesamecross-cumulantmaybecountedseveraltimesin eachof thesecontrasts.Forinstance,thedefinitionof JADEexcludesthecross-cumulantCum(Y1, Y1, Y2, Y3) but includesthecross-cumulantCum(Y1, Y2, Y1, Y3), which is identical.Thus,in order todetermineif anybit of fourth-orderinformationisignoredbyanyparticularcontrast,anonredundantdescriptionshouldbegiven.All thepossiblecross-cumulantscomein four differentpatternsof indices:(ijkl), (iikl), (iijj ), and(ijjj ). Nonredundantexpressionsin termsof thesepatternsarein theform:

φ[Y] = Ca

i<j<k<l

eijkl + Cb

i<k<l

(eiikl + ekkil + ellik )

+ Cc

i<j

eiijj + Cd

i<j

(eijjj + ejjj i

),

where eijkldef= Cum(Yi, Yj, Yk, Yl)

2 andtheCi ’s are numericalconstants.Itremainstocounthowmanytimesauniquecumulantappearsin theredun-dantdefinitionsof the threeapproximationsto mutual informationcon-sideredsofar. We give only the resultof this uninspiringtaskin Table1,whichshowsthatall thecross-cumulantsareactuallyincludedin thethreecontrasts,which thereforediffer only by thedifferentscalarweightsgivento eachparticulartype.It meansthatthethreecontrastsessentiallydo thesamething.Inparticular,whenthenumbernofcomponentsislargeenough,thenumberof cross-cumulantsof type[ijkl] (all indicesdistinct)growsasO(n4), while thenumberof othertypesgrowsasO(n3) atmost.Therefore,the[ijkl] typeoutnumbersall theothertypesfor largen: onemayconjecturetheequivalenceof thethreecontrastsin this limit. Unfortunately, it seems

Page 13: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole

High-OrderContrastsfor IndependentComponentAnalysis 181

Table1: Numberof Timesa Cross-Cumulantof a Given Type Appearsin aGivenContrast.

Constants Ca Cb Cc Cd

Pattern ijkl ii kl i ijj ijjj

ComonICA 24 12 6 4JADE 24 10 4 2SHIBBS 24 12 4 4

difficult to draw more conclusions.For instance,we havementionedtheasymptoticequivalencebetweenComon’scontrastandtheJADEcontrastfor anyn, but it doesnot revealitself directly in theweighttable.

5 A Comparisonon BiomedicalData

The performanceof the algorithmspresentedaboveis illustratedusingthe averagedevent-relatedpotential(ERP)datarecorded and processedbyMakeigandcoworkers.A detailedaccountof theiranalysisis in Makeig,Bell, Jung,andSejnowski(1997).For our comparison,we usethedatasetandthe“logistic ICA” algorithmprovidedwith version3.1of Makeig’sICAtoolbox.3 Thedatasetcontains624datapointsof averagedERPsampledfrom14EEGelectrodes.Theimplementationof thelogisticICA providedinthetoolboxis somewhatintermediatebetweenequation1.1andits off-linecounterpart:H(Y) is averagedthroughsubblocksof thedataset.Thenon-linearfunctionis takento beψ(y) = 2

1+e−y − 1 = tanhy2. This is minusthe

log-derivativeψ(y) = − r′(y)

r(y)of thedensityr(y) = β 1

cosh(y/2)(β is anormal-

izationconstant).Therefore,thismethodmaximizesoverA thelikelihoodofmodelX = ASundertheassumptionsthatShasindependentcomponentswith densitiesequalto β 1

cosh(y/2).

Figure 1 showsthe componentsYJADE producedby JADE (first col-umn) and the componentsYLICA producedby the logistic ICA includedin Makeig’stoolbox,which wasrun with all thedefaultoptions;thethirdcolumnshowsthe differencebetweenthe componentsat the samescale.This directcomparisonis madepossiblewith the following postprocess-ing: thecomponentsYLICA werenormalizedtohaveunit varianceandweresortedby increasingvaluesof kurtosis.ThecomponentsYJADE haveunitvarianceby construction;theyweresortedandtheirsignswerechangedtomatchYLICA . Figure1showsthatYJADE andYLICA essentiallyagreeon9of14components.

3 Availablefromhttp://www.cnl.salk.edu/∼scott/.

182 Jean-FrancoisCardoso

JADE Logistic ICA Difference

Figure1: Thesourcesignalsestimatedby JADEandthelogistic ICA andtheirdifferences.

Page 14: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole

High-OrderContrastsfor IndependentComponentAnalysis 183

Anotherillustrationof this fact is givenby thefirst row of Figure2. Theleft panelshowsthemagnitude|Cij | of theentriesof thetransfermatrix Csuchthat C YLICA = YJADE. This matrix wascomputedafter thepostpro-cessingof thecomponentsdescribedin thepreviousparagraph:it shouldbethe identity matrix if the two methodsagreed,evenonly up to scales,signs,andpermutations.Thefigure showsa strongdiagonalstructure inthe northeastblock while the disagreementbetweenthe two methodsisapparent in thegrayzoneof thesouthwestblock.Theright panelshows

thekurtosisk(YJADEi ) plottedagainstthekurtosisk(YLICA

i ). A keyobserva-tion is that the two methodsdo agreeaboutthemostkurtic components;thesealsoarethecomponentswherethetimestructureis themostvisible.In otherwords,thetwo methodsessentiallyagreewhereveranhumaneyefindsthemostvisiblestructures.Figure2alsoshowstheresultsof SHIBBSandMaxKurt. The transfermatrix C for MaxKurt is seento be more di-agonalthanthetransfermatrix for JADE,while thetransferfor SHIBBSislessdiagonal.Thus,thelogistic ICA andMaxKurt agreemoreon thisdataset.Anotherfigure(notincluded)showsthatJADEandSHIBBSarein verycloseagreementoverall components.

TheseresultsareveryencouragingbecausetheyshowthatvariousICAalgorithmsagreewherevertheyfind structure on this particulardataset.Thisisverymuchinsupportof theICA approachtotheprocessingofsignalsfor which it is not clearthat themodelholds.It leavesopenthequestionof interpretingthedisagreementbetweenthevariouscontrastfunctionsintheswampof thelow kurtosisdomain.

It turns out that the disagreementbetweenthe methodson this datasetis, in our view, an illusion. Considertheeigenvaluesλ1, . . . , λn of thecovariancematrix RX of theobservations.Theyare plottedon a dB scale(this is 10log10λi) in Figure 3. Thetwo leastsignificanteigenvaluesstandratherclearlybelowthestrongestoneswith agapof 5.5dB.Wetakethisasanindicationthatoneshouldlook for 12linearcomponentsin thisdatasetratherthan14,asin thepreviousexperiments.Theresultis ratherstriking:by runningJADEandthelogisticICA onthefirst 12principalcomponents,an excellentagreementis found overall the12extractedcomponents,asseenon Figure4. Thisobservationalsoholdsfor MaxKurt andSHIBBSasshownby Figure5.

Table2liststhenumberof floating-pointoperations(asreturnedbyMat-lab) andtheCPUtime required to run thefour algorithmson a SPARC 2workstation.TheMaxKurt techniquewasclearlythefastesthere;however,it wasapplicableonly becausewewerelookingfor componentswith posi-tive kurtosis.Thesameis truefor theversionof logistic ICA consideredinthis experiment.It is not trueof JADEor SHIBBS,which areconsistentassoonasatmostonesourcehasavanishingkurtosis,regardlessof thesignof the nonzero kurtosis(Cardoso& Souloumiac,1993).The logistic ICArequiredonly about50%more time thanJADE.TheSHIBBSalgorithmis

184 Jean-FrancoisCardoso

JADE

Logi

stic

ICA

Transfer matrix

0

0.2

0.4

0.6

0.8

1

0 5 10

0

5

10

JADE

Logi

stic

ICA

Comparing kurtosis

SHIBBS

Logi

stic

ICA

Transfer matrix

0

0.2

0.4

0.6

0.8

1

0 5 10

0

5

10

SHIBBS

Logi

stic

ICA

Comparing kurtosis

maxkurt

Logi

stic

ICA

Transfer matrix

0

0.2

0.4

0.6

0.8

1

0 5 10

0

5

10

maxkurt

Logi

stic

ICA

Comparing kurtosis

Figure2: (Left column)Absolutevaluesof thecoefficients|Cij | of amatrixrelat-ing thesignalsobtainedby two differentmethods.A perfectagreementwouldbefor C = I: deviationfromdiagonalindicatesadisagreement.Thesignalsaresortedbykurtosis,showingagoodagreementfor highkurtosis.(Rightcolumn)Comparingthe kurtosisof the sourcesestimatedby two different methods.Fromtopto bottom:JADEversuslogisticICA, SHIBBSversuslogisticICA, andmaxkurtversuslogisticICA.

Page 15: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole

High-OrderContrastsfor IndependentComponentAnalysis 185

1 14

−20

−10

0

10

20

Figure 3: Eigenvaluesof the covariancematrix RX of the data in dB (i.e.,10log10(λi)).

Table2: Numberof Floating-PointOperationsandCPUTime.

Flops CPUSecs. Flops CPUSecs.

Method 14Components 12Components

LogisticICA 5.05e+07 3.98 3.51e+07 3.54JADE 4.00e+07 2.55 2.19e+07 1.69SHIBBS 5.61e+07 4.92 2.47e+07 2.35MaxKurt 1.19e+07 1.09 5.91e+06 0.54

slowerthanJADEhere becausethedatasetis not largeenoughto give itanedge.Theseremarksareevenmoremarkedwhencomparingthefiguresobtainedin theextractionof 12components.It shouldbeclearthat thesefiguresdonotprovemuchbecausetheyarerepresentativeof only apartic-ular datasetandof particularimplementationsof thealgorithms,aswellasof thevariousparametersusedfor tuningthealgorithms.However, theydo disprovetheclaimthatalgebraic-cumulantmethodsareof no practicalvalue.

6 Summary and Conclusions

Thedefinitionsof classicentropic contrastsfor ICA canall beunderstoodfromanML perspective.An approximationof theKullback-Leiblerdiver-genceyieldscumulant-basedapproximationsof thesecontrasts.In theor-

186 Jean-FrancoisCardoso

JADE Logistic ICA Difference

Figure4: The12sourcesignalsestimatedbyJADEandalogisticICA outof thefirst 12principalcomponentsof theoriginaldata.

Page 16: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole

High-OrderContrastsfor IndependentComponentAnalysis 187

JADE

Logi

stic

ICA

Transfer matrix

0

0.2

0.4

0.6

0.8

1

0 5 10

0

5

10

JADE

Logi

stic

ICA

Comparing kurtosis

SHIBBS

Logi

stic

ICA

Transfer matrix

0

0.2

0.4

0.6

0.8

1

0 5 10

0

5

10

SHIBBS

Logi

stic

ICA

Comparing kurtosis

maxkurt

Logi

stic

ICA

Transfer matrix

0

0.2

0.4

0.6

0.8

1

0 5 10

0

5

10

maxkurt

Logi

stic

ICA

Comparing kurtosis

Figure5: Samesettingasfor Figure2but theprocessingis restrictedto thefirst12principalcomponents,showingabetteragreementamongall themethods.

188 Jean-FrancoisCardoso

thogonalapproachto ICA wheredecorrelationis enforced,thecumulant-basedcontrastscanbeoptimizedwith Jacobitechniques,operatingoneitherthedataor statisticsof thedata,namely, cumulantmatrices.Thestructureof the cumulantsin the ICA modelcanbe easilyexploitedby algebraicidentificationtechniques,but the simpleversionsof thesetechniquesarenot equivariant.Onepossibility for overcomingthis problemis to exploitthejoint algebraicstructureof severalcumulantmatrices.In particular, theJADEalgorithmbridgesthegapbetweencontrast-basedapproachesandalgebraictechniquesbecausetheJADEobjectiveisbothacontrastfunctionandtheexpressionof theeigenstructureof thecumulants.Moregenerally,thealgebraicnatureof thecumulantscanbeexploitedtoeasetheoptimiza-tion of cumulant-basedcontrastsfunctionsby Jacobitechniques.This canbedonein adata-basedor astatistic-basedmode.Thelatterhasanincreas-ing relativeadvantageasthenumberof availablesamplesincreases,but itbecomesimpracticalfor largenumbersn of componentssincethenumberof fourth-ordercumulantsgrowsasO(n4). Thiscanbeovercometo a cer-tainextentby resortingto SHIBBS,whichiterativelyrecomputesanumberO(n3) of cumulants.

An importantobjectiveof this articlewasto combattheprejudicethatcumulant-basedalgebraicmethodsare impractical.We haveshownthattheycompareverywell tostate-of-the-artimplementationsofadaptivetech-niquesonarealdataset.

Moreextensivecomparisonsremaintobedoneinvolvingvariantsof theideaspresentedhere. A techniquelike JADE is likely to chokeon a verylarge numberof components,but the SHIBBSversionis not asmemorydemanding.Similarly, theMaxKurt methodcanbeextendedto dealwithcomponentswith mixedkurtosissigns.In this respect,it is worth under-lining theanalogybetweentheMaxKurt updateandtherelativegradientupdate,equation1.1,whenfunctionH(·) is in theform of equation1.5.

A commenton tuning the algorithms:In order to codean all-purposeICA algorithmbasedongradientdescent,it is necessaryto deviseasmartlearningschedule.Thisisusuallybasedonheuristicsandrequiresthetuningof someparameters.In contrast,Jacobialgorithmsdonotneedto betunedin theirbasicversions.However,onemaythinkof improvingontheregularJacobisweepthroughall thepairsin prespecifiedorderby devisingmoresophisticatedupdatingschedules.Heuristicswould beneededthen,asinthecaseof gradientdescentmethods.

We concludewith a negativepoint aboutthe fourth-order techniquesdescribedin this article. By nature, they optimizecontrastscorrespond-ing somehowto usinglinear-cubicnonlinearfunctionsin gradient-basedalgorithms.Therefore, they lack the flexibility of adaptingthe activationfunctionsto thedistributionsof theunderlyingcomponentsasonewouldideally do andasis possiblein algorithmslike equation1.1.Evenworse,thisverytypeof nonlinearfunction(linearcubic)hasonemajordrawback:potentialsensitivityto outliers.Thiseffectdid notmanifestitself in theex-

Page 17: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole

High-OrderContrastsfor IndependentComponentAnalysis 189

amplespresentedin thisarticle,but it couldindeedshowup in otherdatasets.

Appendix: Derivation and Implementation of MaxKurt

A.1 GivensAnglesfor MaxKurt. An explicit formof theMaxKurtcon-trastasafunctionof theGivensanglesisderived.Forconciseness,wedenote[ijkl] = QY

ijkl andwedefine

aij = [iiii ] + [ jjjj ] − 6[iijj ]

4bij = [iii j] − [ jjj i] λij =

√a2

ij + b2ij . (A.1)

Thesumof thekurtosisfor thepair of variablesYi andYj aftertheyhavebeenrotatedbyanangleθ dependsonθ asfollows(wherewesetc= cos(θ)

ands= sin(θ)):

k(cos(θ)Yi + sin(θ)Yj) + k(− sin(θ)Yi + cos(θ)Yj) (A.2)

= c4[iiii ] + 4c3s[iii j] + 6c2s2[iijj ] + 4cs3[ijjj ] + s4[ jjjj ] (A.3)

+ s4[iiii ] − 4s3c[iii j] + 6s2c2[iijj ] − 4sc3[ijjj ] + c4[ jjjj ] (A.4)

= (c4 + s4)([iiii ] + [ jjjj ]) + 12c2s2[iijj ] + 4cs(c2 − s2)([iii j] − [ jjj i]) (A.5)

c= −8c2s2[iiii ] + [ jjjj ] − 6[iijj ]

4+ 4cs(c2 − s2)([iii j] − [ jjj i]) (A.6)

= −2sin2(2θ)aij + 2sin(2θ) cos(2θ)bijc= cos(4θ)aij + sin(4θ)bij (A.7)

= λij(cos(4θ) cos(4Äij ) + sin(4θ) sin(4Äij )

)= λij cos(4(θ − Äij )). (A.8)

wheretheangle4Äij is definedby

cos(4Äij ) =aij√

a2ij + b2

ij

sin(4Äij ) =bij√

a2ij + b2

ij

. (A.9)

This is obtainedby usingthemultilinearity andthesymmetriesof thecu-mulantsat linesA.3 andA.4, followedby elementarytrigonometrics.

If Yi andYj are zero-meanandsphered,EYiYj = δij , we have[iiii ] =QY

iiii = EY4i − 3E2Y2

i = EY4i − 3andfor i 6= j: [iii j] = QY

iii j = EY3i Yj aswell as

[iijj ] = QYiijj = EY2

i Y2j − 1.Henceanalternateexpressionfor aij andbij is:

aij = 1

4E

(Y4

i + Y4j − 6Y2

i Y2j

)bij = E

(Y3

i Yj − YiY3j

). (A.10)

It may be interestingto notethat all the momentsrequired to determinetheGivensanglefor agivenpair (i, j) canbeexpressedin termsof thetwo

190 Jean-FrancoisCardoso

variablesξij = YiYj andηij = Y2i − Y2

j . Indeed,it is easilycheckedthatfor azero-meanspheredpair (Yi, Yj), onehas

aij = 1

4E

(η2

ij − 4ξ2ij

)bij = E

(ηijξij

). (A.11)

A.2 A Simple Matlab Implementation of MaxKurt. A Matlab imple-mentationcouldbeasfollows,wherewehavetriedtomaximizereadabilitybutnot thenumericalefficiency:

function Y = maxkurt(X) %[n T] = size(X) ;Y = X - mean(X,2)*ones(1,T); % Remove the meanY = inv(sqrtm(X*X’/T))*Y ; % Sphere the dataencore = 1 ; % Go for first sweepwhile encore, encore=0;

for p=1:n-1, % These two loops gofor q=p+1:n, % through all pairs

xi = Y(p,:).*Y(q,:);eta = Y(p,:).*Y(p,:) - Y(q,:).*Y(q,:);Omega = atan2( 4*(eta*xi’), eta*eta’ - 4*(xi*xi’) );

if abs(Omega) > 0.1/sqrt(T) % A ‘statistically small’% angle

encore = 1 ; % This will not be the%last sweep

c = cos(Omega/4);s = sin(Omega/4);Y([p q],:) = [ c s ; - s c ] * Y([p q],:) ; % Plane

% rotationend

endend

endreturn

Acknowledgments

WeexpressourthankstoScottMakeigandtohiscoworkersfor releasingtothepublicdomainsomeof theircodesandtheERPdataset.Wealsothanktheanonymousreviewerswhoseconstructivecommentsgreatlyhelpedinrevisingthefirst versionof thisarticle.

Page 18: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole

High-OrderContrastsfor IndependentComponentAnalysis 191

References

Amari, S.-I. (1996).Neural learningin structured parameterspaces—NaturalRiemanniangradient.In Proc.NIPS.

Amari, S.-I., Cichocki, A., & Yang,H. (1996).A new learningalgorithm forblind signalseparation.In Advancesin neuralinformationprocessingsystems,8(pp.757–763).Cambridge,MA: MIT Press.

Bell, A. J.,& Sejnowski,T. J.(1995).An information-maximisationapproachtoblind separationandblind deconvolution.NeuralComputation,7, 1004–1034.

Cardoso,J.-F. (1992).Fourth-order cumulantstructure forcing. Application toblind arrayprocessing.In Proc. 6th SSAPworkshopon statisticalsignal andarrayprocessing(pp.136–139).

Cardoso,J.-F. (1994).Ontheperformanceof orthogonalsourceseparationalgo-rithms.In Proc.EUSIPCO(pp.776–779).Edinburgh.

Cardoso,J.-F. (1995).Theequivariantapproachto sourceseparation.In Proc.NOLTA (pp.55–60).

Cardoso,J.-F. (1997).Infomaxandmaximumlikelihood for sourceseparation.IEEELettersonSignalProcessing,4, 112–114.

Cardoso,J.-F. (1998).Blind signalseparation:statisticalprinciples.Proc.of theIEEE.Specialissueonblind identificationandestimation.

Cardoso,J.-F., Bose,S.,& Friedlander, B. (1996).On optimalsourceseparationbasedonsecondandfourthordercumulants.InProc.IEEEWorkshoponSSAP.Corfu,Greece.

Cardoso,J.-F.,& Laheld,B.(1996).Equivariantadaptivesourceseparation.IEEETrans.onSig.Proc.,44, 3017–3030.

Cardoso,J.-F., & Souloumiac,A. (1993).Blind beamformingfor nonGaussiansignals.IEEEProceedings-F, 140, 362–370.

Comon,P. (1994).Independentcomponentanalysis,anewconcept?SignalPro-cessing,36, 287–314.

Comon,P. (1997).Cumulanttensors.In Proc.IEEESPWorkshoponHOS. Banff,Canada.

Cover, T. M., & Thomas,J.A. (1991).Elementsof informationtheory. New York:Wiley.

DeLathauwer,L.,DeMoor,B.,& Vandewalle,J.(1996).Independentcomponentanalysisbasedon higher-orderstatisticsonly. In Proc.IEEESSAPWorkshop(pp.356–359).

Gaeta,M., & Lacoume,J.L. (1990).Sourceseparationwithout a priori knowl-edge:Themaximumlikelihoodsolution.In Proc.EUSIPCO(pp.621–624).

Golub,G. H., & Van Loan,C. F. (1989).Matrix computations. Baltimore: JohnsHopkinsUniversityPress.

MacKay, D. J. C. (1996). Maximumlikelihood and covariant algorithms for in-dependentcomponentanalysis. In preparation.Unpublishedmanuscript.

Makeig,S.,Bell, A., Jung,T.-P., & Sejnowski,T. J. (1997).Blind separationofauditoryevent-relatedbrainresponsesinto independentcomponents.Proc.Nat.Acad.Sci.USA,94, 10979–10984.

McCullagh,P. (1987).Tensormethodsin statistics. London:ChapmanandHall.

192 Jean-FrancoisCardoso

Moreau,E.,& Macchi,O. (1996).High ordercontrastsfor self-adaptivesourceseparation.InternationalJournalof AdaptiveControl andSignalProcessing,10,19–46.

Moulines,E.,Cardoso,J.-F.,& Gassiat,E.(1997).Maximumlikelihoodfor blindseparationanddeconvolutionofnoisysignalsusingmixturemodels.In Proc.ICASSP’97(pp.3617–3620).

Nadal,J.-P., & Parga,N. (1994).Nonlinearneuronsin the low-noiselimit: Afactorialcodemaximizesinformationtransfer. NETWORK,5, 565–581.

Obradovic,D., & Deco,G. (1997).Unsupervisedlearningfor blind sourcesep-aration:An information-theoreticapproach.In Proc.ICASSP(pp.127–130).

Pearlmutter, B. A., & Parra,L. C. (1996).A context-sensitivegeneralizationofICA. In InternationalConferenceonNeuralInformationProcessing(HongKong).

Pham,D.-T. (1996).Blind separationof instantaneousmixtureof sourcesvia anindependentcomponentanalysis.IEEETrans.onSig.Proc.,44, 2768–2779.

Pham,D.-T., & Garat,P. (1997).Blind separationof mixture of independentsourcesthroughaquasi-maximumlikelihoodapproach.IEEETr.SP,45,1712–1725.

Souloumiac,A.,& Cardoso,J.-F.(1991).Comparaisondemethodesdeseparationdesources.In Proc.GRETSI. JuanlesPins,France(pp.661–664).

ReceivedFebruary7,1998;acceptedJune25,1998.