high-ordercontrasts for independent component analysis · letter communicated by anthony bell...

LETTER Communicatedby Anthony Bell

High-Order Contrasts for IndependentComponentAnalysis

Jean-FrancoisCardosoEcoleNationaleSuperieuredesTelecommunications,75634ParisCedex13,France

This article considershigh-order measuresof independencefor the inde-pendentcomponentanalysisproblem anddiscussestheclassof Jacobial-gorithms for their optimization. Severalimplementationsare discussed.We compare the proposed approacheswith gradient-basedtechniquesfrom the algorithmic point of view and alsoon a setof biomedicaldata.

1 Introduction

Givenann × 1 randomvectorX, independentcomponentanalysis(ICA)consistsof finding abasisof Rn onwhich thecoefficientsof X areas inde-pendentaspossible(in someappropriatesense).Thechangeof basiscanberepresentedby ann × n matrixB andthenewcoefficientsgivenby theentriesof vectorY = BX. WhentheobservationvectorX is modeledasalinearsuperpositionof sourcesignals,matrixB is understoodasaseparat-ing matrix,andvectorY = BX is avectorof sourcesignals.Two keyissuesof ICA are thedefinitionof a measure of independenceandthedesignofalgorithmsto find thechangeof basis(or separatingmatrix)B optimizingthismeasure.

Many recentcontributionsto the ICA problemin the neuralnetworkliteraturedescribestochasticgradientalgorithmsinvolving asanessentialdevicein their learningrule a nonlinearactivationfunction.Other ideasfor ICA, mostof themfoundin thesignalprocessingliterature,exploit thealgebraicstructureof high-ordermomentsof theobservations.Theyareof-tenregardedasbeingunreliable,inaccurate,slowlyconvergent,andutterlysensitiveto outliers.As a matterof fact, it is fairly easyto devisean ICAmethoddisplayingall theseflawsandworkingononlycarefully generatedsyntheticdatasets.Thismaybethereasonthatcumulant-basedalgebraicmethodsarelargelyignoredby theresearchersof theneuralnetworkcom-munity involvedin ICA. Thisarticletriesto correctthis view by showinghowhigh-ordercorrelationscanbeefficiently exploitedto revealindepen-dentcomponents.

This articledescribesseveralICA algorithmsthatmaybecalledJacobialgorithmsbecausetheyseekto maximizemeasuresof independenceby atechniqueakin to theJacobimethodof diagonalization.Thesemeasuresofindependenceare basedon fourth-order correlationsbetweenthe entriesof Y. As a benefit,thesealgorithmsevadesthecurseof gradientdescent:

NeuralComputation11, 157–192(1999) c© 1999MassachusettsInstituteof Technology

158 Jean-FrancoisCardoso

theycanmovein macroscopicstepsthroughtheparameterspace.Theyalsohaveotherbenefitsanddrawbacks,whicharediscussedin thearticleandsummarizedin afinalsection.Beforeoutliningthecontentof thisarticle,webrieflyreviewsomegradient-basedICA methodsandthenotionof contrastfunction.

1.1 Gradient Techniquesfor ICA. Manyonlinesolutionsfor ICA thathavebeenproposedrecentlyhavethemerit of a simpleimplementation.Amongtheseadaptiveprocedures,a specificclasscanbesingledout: al-gorithmsbasedon a multiplicativeupdateof anestimateB(t) of B. Thesealgorithmsupdatea separatingmatrix B(t) on receptionof a newsamplex(t) according to thelearningrule

y(t) = B(t)x(t), B(t + 1) =(I − µtH(y(t))

)B(t), (1.1)

where I denotesthen × n identitymatrix, {µt} is ascalarsequenceof pos-itive learningsteps,andH: Rn → Rn×n is avector-to-matrixfunction.Thestationarypointsof suchalgorithmsarecharacterizedby theconditionthattheupdatehaszero mean,thatis,by thecondition,

EH(Y) = 0. (1.2)

Theonlinescheme,in equation1.1,canbe(andoften is) implementedinanoff-line manner. UsingT samplesX(1), . . . , X(T), onegoesthroughthefollowing iterationswherethefield H is averagedoverall thedatapoints:

1. Initialization. Sety(t) = x(t) for t = 1, . . . , T.

2. Estimatetheaveragefield.H = 1T

∑Tt=1H(y(t)).

3. Update. If H is smallenough,stop;elseupdateeachdatapointy(t) byy(t) ← (I − µH)y(t) andgoto 2.

Thealgorithmstopsfor a (arbitrarily) small valueof the averagefield: itsolvestheestimatingequation,

1

T

T∑

t=1

H(y(t)) = 0, (1.3)

which is thesamplecounterpartof thestationarityconditionin equation1.2.

Both theonlineandoff-line schemesaregradientalgorithms:themap-pingH(·) canbeobtainedasthegradient(therelativegradient[Cardoso&Laheld,1996]orAmari’snaturalgradient[1996])of somecontrastfunction,that is, a real-valuedmeasure of how far thedistributionY is from someidealdistribution,typically a distributionof independentcomponents.In

High-OrderContrastsfor IndependentComponentAnalysis 159

particular, thegradientof theinfomax—maximumlikelihood(ML) contrastyieldsafunctionH(·) in theform

H(y) = ψ(y)y† − I, (1.4)

whereψ(y) is ann × 1vectorof component-wisenonlinearfunctionswithψi(·) takento beminusthelog derivativeof thedensityof thei component(seeAmari, Cichocki,& Yang,1996,for the online versionand Pham&Garat,1997,for abatchtechnique).

1.2 The Orthogonal Approach to ICA. In the search for independentcomponents,onemay decide,asin principal componentanalysis(PCA),to requestexactdecorrelation(second-order independence)of thecompo-nents:matrixB shouldbesuchthatY = BX is “spatiallywhite,” thatis, itscovariancematrix is the identity matrix. Thealgorithmsdescribedin thisarticletakethisdesignoption,whichwecall theorthogonalapproach.

It mustbestressedthatcomponentsthatareasindependentaspossibleaccording to somemeasure of independenceare not necessarilyuncorre-latedbecauseexactindependencecannotbeachievedin mostpracticalap-plications.Thus,if decorrelationis desired,it mustbeenforcedexplicitly;the algorithmsdescribedbelow optimizeunderthe whitenessconstraintapproximationsof themutualinformationandof othercontrastfunctions(possiblydesignedto takeadvantageof thewhitenessconstraint).

Onepracticalreasonfor consideringtheorthogonalapproachis thatoff-line contrastoptimizationmay be simplified by a two-stepprocedure asfollows. First,a whitening(or “sphering”)matrix W is computedandap-plied to the data.Sincethe new dataare spatiallywhite andoneis alsolookingfor awhitevectorY, thelattercanbeobtainedonlybyanorthonor-maltransformationV of thewhiteneddatabecauseonlyorthonormaltrans-formscanpreservethewhiteness.Thus,in sucha scheme,theseparatingmatrixB is foundasaproductB = VW. Thisapproachleadsto interestingimplementationsbecausethewhiteningmatrixcanbeobtainedstraightfor-wardly asanymatrixsquarerootof theinversecovariancematrixof X andtheoptimizationof acontrastfunctionwith respectto anorthonormalma-trix canalsobeimplementedefficiently by theJacobitechniquedescribedin section4.

Theorthonormalapproachto ICA neednot beimplementedasa two-stageJacobi-basedprocedure;it alsoexistsasaone-stagegradientalgorithm(seealsoCardoso& Laheld,1996).Assumethattherelative/naturalgradientof somecontrastfunctionleadsto aparticularfunctionH(·) for theupdaterule,equation1.1,with stationarypointsgivenbyequation1.2.Thenthesta-tionarypointsfor theoptimizationof thesamecontrastfunctionwith respectto orthonormaltransformationsare characterizedby EH(Y) − H(Y)† = 0wherethesuperscript†denotestransposition.Ontheotherhand,for zero-meanvariables,thewhitenessconstraintis EYY† = I, which we canalso


writeasEYY†− I = 0.BecauseEYY†− I isasymmetricmatrix matrixwhileEH(Y) − H(Y)† is a skew-symmetricmatrix, thewhitenessconditionandthestationarityconditioncanbecombinedin a singleoneby just addingthem.Theresultingconditionis E{YY† − I + H(Y) − H(Y)†} = 0. Whenitholdstrue,both thesymmetricpart andtheskew-symmetricpart cancel;theformerexpressesthatY is white, thelatterthatthecontrastfunctionisstationarywith respectto all orthonormaltransformations.

Thus,if thealgorithmin equation1.1optimizesagivencontrastfunctionwith H givenby equation1.4,thenthesamealgorithmoptimizesthesamecontrastfunctionunderthewhitenessconstraintwith H givenby

H(y) = yy† − I + ψ(y)y† − yψ(y)†. (1.5)

It is thussimpleto implementorthogonalversionsof gradientalgorithmsoncearegularversionis available.

1.3 Data-BasedVersusStatistic-BasedTechniques.Comon(1994)com-paresthedata-basedoptionandthestatistic-basedoption for computingoff-line anICA of a batchx(1), . . . , x(T) of T samples;this articlewill alsointroduceamixedstrategy(seesection4.3).In thedata-basedoption,succes-sivelineartransformationsareappliedto thedatasetuntil somecriterionofindependenceismaximized.Thisis theiterativetechniqueoutlinedabove.Notethatit isnotnecessarytoupdateexplicitly aseparatingmatrixB in thisscheme(althoughonemaydecideto dosoin aparticularimplementation);thedatathemselvesare updateduntil theaveragefield 1

T

∑Tt=1H(y(t)) is

small enough;the transformB is implicitly containedin the setof trans-formeddata.

Anotheroptionistosummarizethedatasetintoasmallersetof statisticscomputedonceandfor all from thedataset;thealgorithmthenestimatesa separatingmatrix asa functionof thesestatisticswithout accessingthedata.Thisoptionmaybefollowedin cumulant-basedalgebraictechniqueswherethestatisticsarecumulantsof X.

1.4 Outline of the Article. In section2,theICA problemis recastin theframeworkof (blind) identification,showinghowentropiccontrastsreadilystemfromthemaximumlikelihood(ML) principle.In section3,high-orderapproximationstotheentropiccontrastsaregiven,andtheiralgebraicstruc-tureisemphasized.Section4describesdifferentflavorsofJacobialgorithmsoptimizingfourth-ordercontrastfunctions.A comparisonbetweenJacobitechniquesandagradient-basedalgorithmis givenin section5basedonarealdatasetof electroencephalogram(EEG)recordings.


2 Contrast Functionsand Maximum Likelihood Identification

Implicitly or explicitly, ICA triesto fit amodelfor thedistributionof X thatis a modelof independentcomponents:X = AS, where A is an invertiblen×nmatrixandSisann×1vectorwith independententries.EstimatingtheparameterA fromsamplesof X yieldsaseparatingmatrixB = A−1. EvenifthemodelX = ASisnotexpectedtoholdexactlyformanyrealdatasets,onecanstill useit toderivecontrastfunctions.Thissectionexhibitsthecontrastfunctionsassociatedwith theestimationof A by theML principle(amoredetailedexpositioncanbefoundin Cardoso,1998).Blind separationbasedonML wasfirst consideredby GaetaandLacoume(1990)(but theauthorsusedcumulantapproximationsasthosedescribedin section3),PhamandGarat(1997),andAmari etal. (1996).

2.1 Likelihood. Assumethattheprobabilitydistributionof eachentrySi of Shasa densityri(·).1 Then,thedistributionPS of therandomvectorShasadensityr(·) in theform r(s) =

∏ni=1 ri(si), andthedensityof X for a

givenmixtureA andagivenprobabilitydensityr(·) is:

p(x; A, r) = | detA|−1r(A−1x), (2.1)

sothatthe(normalized)log-likelihoodLT(A, r) of T independentsamplesx(1), . . . , x(T) of X is

LT(A, r)def= 1

T

T∑

t=1

logp(x(t); A, r)

= 1

T

T∑

t=1

logr(A−1x(t)) − log | detA|. (2.2)

Dependingontheassumptionsmadeaboutthedensitiesr1, . . . , rn, severalcontrastfunctionscanbederivedfromthis log-likelihood.

2.2 Likelihood Contrast. Undermild assumptions,thenormalizedlog-likelihoodLT(A, r), which is asampleaverage,convergesfor largeT to itsensembleaverageby law of largenumbers:

LT(A, r) = 1

T

T∑

t=1

logr(A−1x(t)) − log | detA|

−→T→∞ Elogr(A−1x) − log | detA|, (2.3)

1 All densitiesconsideredin this articlearewith respectto theLebesguemeasureonR or Rn.


whichsimplemanipulations(Cardoso,1997)showtobeequalto−H(PX)−K(PY|PS). Hereandin thefollowing, H(·) andK(·|·), respectively, denotethedifferentialentropy andtheKullback-Leiblerdivergence.SinceH(PX)

doesnotdependonthemodelparameters,thelimit for largeT of −LT(A, r)is,up to aconstant,equalto

φML(Y)def= K(PY|PS). (2.4)

Therefore,theprincipleof ML coincideswith theminimizationof aspecificcontrastfunction,whichisnothingbutthe(Kullback)divergenceK(PY|PS)

betweenthedistributionPY of theoutputandamodeldistributionPS.Theclassicentropic contrastsfollow from this observation,depending

ontwooptions:(1) tryingornottoestimatePS fromthedataand(2) forcingor not thecomponentsto beuncorrelated.

2.3 Infomax. ThetechnicallysimpleststatisticalassumptionaboutPS

is to selectfixed densitiesr1, . . . , rn for eachcomponent,possiblyon thebasisof prior knowledge.ThenPS isafixeddistributionalassumption,andtheminimizationof φML(Y) is performedonly overPY via Y = BX. Thiscanbe rephrased:ChooseB suchthat Y = BX is ascloseaspossibleindistribution to the hypothesizedmodeldistributionPS, the closenessindistributionbeingmeasured in the Kullback divergence.This is alsothecontrastfunctionderivedfromtheinfomaxprinciplebyBell andSejnowski(1995).The connectionbetweeninfomax and ML wasnotedin Cardoso(1997),MacKay(1996),andPearlmutterandParra(1996).

2.4 Mutual Information. Thetheoreticallysimpleststatisticalassump-tion aboutPS is to assumeno modelat all. In this case,theKullbackmis-matchK(PY|PS) shouldbe minimizednot only by optimizing over B tochangethedistributionof Y = BXbutalsowith respecttoPS.ForeachfixedB, that is, for eachfixed distributionPY, theresultof this minimizationistheoretically very simple:theminimumis reachedwhenPS = PY, whichdenotesthe distributionof independentcomponentswith eachmarginaldistributionequalto the correspondingmarginal distributionof Y. Thisstemsfromthepropertythat

K(PY|PS) = K(PY|PY) + K(PY|PS) (2.5)

for anydistributionPS with independentcomponents(Cover& Thomas,1991).Therefore,theminimuminPSof K(PY|PS) is reachedbytakingPS =PY sincethis choiceensuresK(PY|PS) = 0.Thevalueof φML at this pointthenis

φMI (Y)def= min

PS

K(PY|PS) = K(PY|PY). (2.6)


We usethe indexMI sincethis quantityis well known asthe mutual in-formationbetweentheentriesof Y. It wasfirst proposedby Comon(1994),andit canbeseenfromtheaboveasderivingfromtheML principlewhenoptimizationis with respectto boththeunknownsystemA andthedistri-butionof S. Thisconnectionwasalsonotedin ObradovicandDeco(1997),andtherelationbetweeninfomaxandmutualinformationisalsodiscussedin NadalandParga(1994).

2.5 Minimum Marginal Entropy. An orthogonalcontrastφ(Y) is, bydefinition, to beoptimizedundertheconstraintthat Y is spatiallywhite:orthogonalcontrastsenforcedecorrelation,thatis,anexact“second-order”independence.Any regularcontrastcanbeusedunderthewhitenesscon-straint,but by taking the whitenessconstraintinto account,the contrastmaybegivenasimplerexpression.Thisis thecaseof somecumulant-basedcontrastsdescribedin section3.It isalsothecaseof φMI (Y) becausethemu-tual informationcanalsobeexpressedasφMI (Y) =

∑ni=1H(PYi ) − H(PY);

sincethe entropy H(PY) is constantunderorthonormaltransforms,it isequivalentto consider

φME(Y) =n∑

i=1

H(PYi ) (2.7)

to be optimizedunderthe whitenessconstraintEYY† = I. This contrastcouldbecalledorthogonalmutualinformation,orthemarginalentropycon-trast.Theminimumentropyideaholdsmoregenerallyunderanyvolume-preservingtransform(Obradovic& Deco,1997).

2.6 Empirical Contrast Functions. Amongall theabovecontrasts,onlyφML or its orthogonalversionareeasilyoptimizedby agradienttechniquebecausetherelativegradientof φML simply is thematrix EH(Y) with H(·)definedin equation1.4.Therefore,therelativegradientalgorithm,equation1.1,canbeemployedusingeitherthisfunctionH(·) oritssymmetrizedform,equation1.5,if onechoosestoenforcedecorrelation.However, thiscontrastis basedon a prior guessPS aboutthedistributionof thecomponents.Iftheguessis toofaroff, thealgorithmwill fail todiscoverindependentcom-ponentsthat might be presentin the data.Unfortunately, evaluatingthegradientof contrastsbasedon mutualinformationor minimummarginalentropy is more difficult becauseit doesnot reduceto theexpectationofa simplefunctionof Y; for instance,Pham(1996)minimizesexplicitly themutualinformation,but thealgorithminvolvesa kernelestimationof themarginaldistributionsof Y.An intermediateapproachistoconsiderapara-metricestimationof thesedistributionsasin Moulines,Cardoso,andGas-siat(1997)or PearlmutterandParra(1996),for instance.Therefore,all thesecontrastsrequirethatthedistributionsof componentsbeknown,orapprox-


imatedor estimated.As we shallseenext,this is alsowhat thecumulantapproximationsto contrastfunctionsareimplicitly doing.

3 Cumulants

This sectionpresentshigher-order approximationsto entropic contrasts,someknownandsomenovel.To keeptheexpositionsimple,it is restrictedto symmetricdistributions(for whichodd-ordercumulantsareidenticallyzero) andto cumulantsof orders2and4.Recallthatfor randomvariables

X1, . . . , X4,second-ordercumulantsareCum(X1, X2)def= EX1X2whereXi

def=Xi − EXi andthefourth-ordercumulantsare

Cum(X1, X2, X3, X4) =EX1X2X3X4 − EX1X2EX3X4 − EX1X3EX2X4 − EX1X4EX2X3. (3.1)

Thevarianceandthekurtosisof arealrandomvariableX aredefinedas

σ2(X)def= Cum(X, X) = EX2,

k(X)def= Cum(X, X, X, X) = EX4 − 3E2X2, (3.2)

that is, theyare thesecond-andfourth-orderautocumulants.A cumulantinvolving at leasttwo differentvariablesis calledacross-cumulant.

3.1 Cumulant-Based Approximations to Entropic Contrasts. Cumu-lantsare usefulin manyways.In this section,theyshowup becausetheprobabilitydensityof ascalarrandomvariableU closeto thestandard nor-maln(u) = (2π)−1/2exp−u2/2canbeapproximatedas

p(u) ≈ n(u)

(1+ σ2(U) − 1

2h2(u) + k(U)

4!h4(u)

), (3.3)

whereh2(u) = u2− 1andh4(u) = u4− 6u2+ 3,respectively, arethesecond-andfourth-orderHermitepolynomials.Thisexpressionis obtainedby re-taining the leadingtermsin an Edgeworthexpansion(McCullagh,1987).If U andV are two real randomvariableswith distributionscloseto thestandard normal,onecan,at leastformally, useexpansion3.3to deriveanapproximationto K(PU|PV). Thisis

K(PU|PV) ≈ 1

4(σ2(U) − σ2(V))2 + 1

48(k(U) − k(V))2, (3.4)

whichshowshowthepair(σ2, k) ofcumulantsoforder2and4playin somesensetheroleof a localcoordinatesystemaroundn(u) with thequadratic


form 3.4playingtherole of a local metric.This resultgeneralizesto mul-tivariates,in whichcasewedenotefor concisenessRU

ij = Cum(Ui, Uj) and

QUijkl = Cum(Ui, Uj, Uk, Ul) andsimilarly for anotherrandomn-vectorV

with entriesV1, . . . , Vn. We give without proof the following approxima-tion:

K(PU|PV) ≈ K24(PU|PV)def= 1

4

∑

ij

(RU

ij − RVij

)2

+ 1

48

∑

ijkl

(QU

ijkl − QVijkl

)2. (3.5)

Expression3.5turnsout to bethesimplestpossiblemultivariategeneral-izationof equation3.4(thetwo termsin equation3.5areadoublesumoverall then2 pairsof indicesandaquadruplesoverall then4 quadruplesof in-dices).Sincetheentropiccontrastslistedabovehaveall beenderivedfromthe Kullback divergence,cumulantapproximationsto all thesecontrastscanbeobtainedby replacingtheKullbackmismatchK(PU|PV) byacrudermeasure:its approximationis acumulantmismatchby equation3.5.

3.1.1 ApproximationtotheLikelihoodContrast. Theinfomax-MLcontrastφML(Y) = K(PY|PS) for ICA (seeequation2.4)is readilyapproximatedbyusingexpression3.5.TheassumptionPS on the distributionof S is nowreplacedby anassumptionaboutthecumulantsof S. Thisamountsto verylittle: all thecross-cumulantsof Sbeing0thanksto theassumptionof inde-pendentsources,it is neededonly to specifytheautocumulantsσ2(Si) andk(Si). Thecumulantapproximation(seeequation3.5) to the infomax-MLcontrastbecomes:

φML(Y) ≈ K24(PY|PS) = 1

4

∑

ij

(RY

ij − σ2(Si)δij

)2

+ 1

48

∑

ijkl

(QY

ijkl − k(Si)δijkl

)2, (3.6)

wheretheKroneckersymbolδ equals1with identicalindicesand0other-wise.

3.1.2 Approximationto theMutual InformationContrast. Themutualin-formationcontrastφMI (Y) wasobtainedby minimizing K(PY|PS) overallthe distributionsPS with independentcomponents.In the cumulantap-proximation,thisis trivially done:thefreeparametersforPS areσ2(Si) andk(Si).Eachof thesescalarsentersin onlyonetermof thesumsin equation3.6sothat theminimizationis achievedfor σ2(Si) = RY

ii andk(Si) = QYiiii . In

otherwords,theconstructionof thebestapproximatingdistributionwith


independentmarginalsPY, which appearsin equation2.5,boils down,inthecumulantapproximation,to theestimationof thevarianceandkurtosisof eachentryof Y. Fitting bothσ2(Si) andk(Si) to RY

ii andQYiii , respectively,

hastheeffectofexactlycancellingthediagonaltermsinequation3.6,leavingonly

φMI (Y) ≈ φMI24 (Y)

def= 1

4

∑

ij 6=ii

(RY

ij

)2+ 1

48

∑

ijkl 6=iiii

(QY

ijkl

)2, (3.7)

whichisourcumulantapproximationto themutualinformationcontrastinequation2.6.Thefirst termisunderstoodasthesumoverall thepairsofdis-tinct indices;thesecondtermisasumoverall quadruplesof indicesthatarenotall identical.It containsonlyoff-diagonalterms,thatis,cross-cumulants.Sincecross-cumulantsof independentvariablesidenticallyvanish,it is notsurprisingtoseethemutualinformationapproximatedbyasumofsquaredcross-cumulants.

3.1.3 Approximationto theOrthogonalLikelihoodContrast. Thecumulantapproximationto theorthogonallikelihoodis fairly simple.Theorthogonalapproachconsistsof first enforcing thewhitenessof Y that is RY

ij = δij or

RY = I. Inotherwords,it consistsofnormalizingthecomponentsbyassum-ingthatσ2(Si) = 1andmakingsurethesecond-ordermismatchiszero.Thisisequivalentto replacingtheweight 1

4 in equation3.6byaninfiniteweight,hencereducingtheproblemto theminimization(underthewhitenesscon-straint)of the fourth-order mismatch,or the second(quadruple) suminequation3.6.Thus,theorthogonallikelihoodcontrastis approximatedby

φOML24 (Y)

def= 1

48

∑

ijkl

(QY

ijkl − k(Si)δijkl

)2. (3.8)

Thiscontrasthasaninterestingalternateexpression.Developingthesquaresgives

φOML24 (Y) = 1

48

∑

ijkl

(QYijkl)

2 + 1

48

∑

ijkl

k2(Si)δ2ijkl − 2

48

∑

ijkl

k(Si)δijklQYijkl.

Thefirstsumaboveisconstantunderthewhitenessconstraint(thisisreadilycheckedusingequation3.13for anorthonormaltransform),andthesecondsum doesnot dependon Y; finally the last sum containsonly diagonalnonzero terms.It follows that:

φOML24 (Y)

c= − 1

24

∑

i

k(Si)QYiiii

= − 1

24

∑

i

k(Si)k(Yi)c= − 1

24

∑

i

k(Si)EY4i , (3.9)


where· c= · denotesanequalityuptoaconstant.An interpretationof thesec-ondequalityisthatthecontrastisminimizedbymaximizingthescalarprod-uctbetweenthevector[k(Y1), . . . , k(Yn)] of thekurtosisof thecomponentsandthe correspondingvectorof hypothesizedkurtosis[k(S1), . . . , k(Sn)].Thelastequalitystemsfrom thedefinition in equation3.2of thekurtosisandtheconstancyof EY2

i underthewhitenessconstraint.This lastform is

remarkablebecauseit showsthat for zero-meanobservations,φOML24 (Y)

c=El(Y),wherel(Y) = − 1

24

∑i k(Si)Y4

i ,sothecontrastisjusttheexpectationofasimplefunctionof Y.Wecanexpectsimpletechniquesfor itsmaximization.

3.1.4 Approximationto theMinimumMarginal EntropyContrast. Underthewhitenessconstraint,thefirstsumin theapproximation,equation3.7,iszero (this is thewhitenessconstraint)sothattheapproximationto mutualinformationφMI (Y) reducesto thelastterm:

φME(Y) ≈ φME24 (Y)

def= 1

48

∑

ijkl 6=iiii

(QY

ijkl

)2 c= − 1

48

∑

i

(QY

iiii

)2. (3.10)

Again, thelastequalityup to constantfollows from theconstancyof∑

ijkl

(QYijkl)

2 underthewhitenessconstraint.Theseapproximationshadalreadybeenobtainedby Comon(1994)from anEdgeworthexpansion.Theysaysomethingsimple:Edgeworthexpansionssuggesttestingtheindependencebetweentheentriesof Y by summingup all thesquaredcross-cumulants.

In thecourseof this article,we will find two similar contrastfunctions.TheJADEcontrast,

φJADE(Y)def=

∑

ijkl 6=iikl

(QY

ijkl

)2, (3.11)

alsois a sumof squaredcross-cumulants(thenotationindicatesa sumisoverall thequadruples(ijkl) of indiceswith i 6= j). Its interestis to bealsoacriterionof joint diagonalityof cumulantsmatrices.TheSHIBBScriterion,

φSH(Y)def=

∑

ijkl 6=iikk

(QY

ijkl

)2, (3.12)

is alsointroducedin section4.3asgoverninga similar but lessmemory-demandingalgorithm.It also involvesonly cross-cumulants:thosewithindices(ijkl) suchthat i 6= j or k 6= l.

3.2 Cumulants and Algebraic Structures. Previoussectionsreviewedthe useof cumulantsin designingcontrastfunctions.Another threadofideasusingcumulantsstemsfrom the methodof moments.Suchan ap-proachis calledfor by themultilinearity of thecumulants.Undera linear


transformY = BX,whichalsoreadsYi =∑

p bipXp, thecumulantsof order4(for instance)transformas:

Cum(Yi, Yj, Yk, Yl) =∑

pqrs

bipbjqbkrblsCum(Xp, Xq, Xr, Xs), (3.13)

whichcaneasilybeexploitedfor ourpurposessincetheICA modelislinear.Usingthis factandtheassumptionof independenceby whichCum(Sp, Sq,

Sr, Ss) = k(Sp)δ(p, q, r, s),wereadilyobtainthesimplealgebraicstructureofthecumulantsof X = ASwhenShasindependententries,

Cum(Xi, Xj, Xk, Xl) =n∑

u=1

k(Su)aiuajuakualu, (3.14)

whereaij denotesthe(ij)th entryof matrix A. WhenestimatesCum(Xi, Xj,

Xk, Xl) areavailable,onemaytry to solveequation3.4in thecoefficientsaij

of A. Thisis tantamountto cumulantmatchingontheempiricalcumulantsof X.Becauseof thestrongalgebraicstructureof equation3.14,onemaytryto devisefourth-orderfactorizationsakin to thefamiliar second-ordersin-gularvaluedecomposition(SVD)or eigenvaluedecomposition(EVD) (seeCardoso,1992;Comon,1997;DeLathauwer,DeMoor,& Vandewalle,1996).However, theseapproachesaregenerallynotequivalenttotheoptimizationof acontrastfunction,resultingin estimatesthataregenerallynotequivari-ant(Cardoso,1995).Thispoint is illustratedbelow;weintroducecumulantmatriceswhosesimplestructureoffersstraightforward identificationtech-niques,but we stress,asoneof their importantdrawbacks,their lack ofequivariance.However, weconcludeby showinghow thealgebraicpointof view andthestatistical(equivariant)pointof view canbereconciled.

3.2.1 CumulantMatrices. Thealgebraicnatureof cumulantsis tensorial(McCullagh,1987),butsincewewill concernourselvesmainlywith second-andfourth-orderstatistics,amatrix-basednotationsufficesfor thepurposeof ourexposition;weonly introducethenotionof cumulantmatrixdefinedasfollows. Givena randomn × 1 vectorX andany n × n matrix M, wedefinetheassociatedcumulantmatrix QX(M) asthen × n matrix definedcomponent-wiseby

[QX(M)]ijdef=

n∑

k,l=1

Cum(Xi, Xj, Xk, Xl)Mkl. (3.15)

If X is centered,thedefinitionin equation3.1showsthat

QX(M) = E{(X†MX) XX†} − RXtr(MRX) − RXMRX − RXM†RX,(3.16)


where tr(·) denotesthe traceandRX denotesthe covariancematrix of X,that is, [RX]ij = Cum(Xi, Xj). Equation3.16couldhavebeenchosenasanindex-freedefinitionof cumulantmatrices.It showsthatagivencumulantmatrixcanbecomputedor estimatedatacostsimilarto theestimationcostof acovariancematrix;thereisnoneedto computethewholesetof fourth-ordercumulantsto obtainthevalueof QX(M) for a particularvalueof M.Actually, estimatinga particularcumulantmatrix is oneway of collectingpartof thefourth-orderinformationin X; collectingthewholefourth-orderinformationrequirestheestimationof O(n4) fourth-ordercumulants.

Thestructure of a cumulantmatrix QX(M) in the ICA modelis easilydeducedfromequation3.14:

QX(M) = A1(M)A†

1(M) = Diag(k(S1) a†

1Ma1, . . . , k(Sn) a†nMan

), (3.17)

whereai denotestheith columnof A, thatis,A = [a1, . . . , an]. In thisfactor-ization,the(generallyunknown)kurtosisenteronly in thediagonalmatrix1(M), a fact implicitly exploitedby thealgebraictechniquesdescribedbe-low.

3.3 Blind Identification UsingAlgebraic Structures. Insection3.1,con-trast functionswere derivedfrom the ML principle assumingthe modelX = AS. In thissection,weproceedsimilarly: weconsidercumulant-basedblind identificationof A assumingX = ASfrom which thestructures3.14and3.17result.

Recallthattheorthogonalapproachcanbeimplementedbyfirstsphering

explicitly vectorX. LetW beawhitening,anddenoteZdef= WX thesphered

vector.Withoutlossofgenerality, themodelcanbenormalizedbyassumingthat the entriesof S haveunit varianceso that S is spatiallywhite. Since

Z = WX = WASisalsowhitebyconstruction,thematrixUdef= WAmustbe

orthonormal:UU† = I. ThereforespheringyieldsthemodelZ = USwithU orthonormal.Of course,this is still amodelof independentcomponentssothat,similar to equation3.17,wehavefor anymatrixM thestructureofthecorrespondingcumulantmatrixof Z,

QZ(M) = U1(M)U†

1(M) = Diag(k(S1) u†

1Mu1, . . . , k(Sn) u†nMun

), (3.18)

where ui denotesthe ith columnof U. In a practicalorthogonalstatistic-basedtechnique,onewould first estimatea whiteningmatrix W, estimatesomecumulantsof Z = WX, computean orthonormalestimateU of Uusingthesecumulants,andfinally obtainanestimateA of A asA = W−1Uor obtainaseparatingmatrixasB = U−1W = U†W.


3.3.1 NonequivariantBlind IdentificationProcedures. Wefirstpresenttwoblind identificationproceduresthatexploitin astraightforward mannerthestructure3.17;weexplainwhy, in spiteof attractivecomputationalsimplic-ity, theyarenotwell behaved(notequivariant)andhow theycanbefixedfor equivariance.

Thefirst ideais not basedon anorthogonalapproach.Let M1 andM2

be two arbitrary n × n matrices,and defineQ1def= QX(M1) and Q2

def=QX(M2). According to equation3.17,if X = ASwehaveQ1 = A11A† and

Q2 = A12A† with 11 and12 two diagonalmatrices.Thus,Gdef= Q1Q

−12 =

(A11A†)(A12A†)−1 = A1A−1, where 1 is thediagonalmatrix 111−12 . It

follows thatGA = A1, meaningthatthecolumnsof A aretheeigenvectorsof G (possiblyup to scalefactors).

An extremelysimplealgorithmfor blind identificationof A follows: Se-lect two arbitrarymatricesM1 andM2; computesampleestimatesQ1 andQ2 usingequation3.16;find thecolumnsof A astheeigenvectorsof Q1Q

−12 .

Thereisat leastoneproblemwith thisidea:wehaveassumedinvertiblema-tricesthroughoutthederivation,andthismayleadto instability. However,thisspecificproblemmaybefixedby sphering,asexaminednext.

Considernowtheorthogonalapproachasoutlinedabove.LetM besomearbitrarymatrix M, andnotethatequation3.18is aneigendecomposition:the columnsof U are the eigenvectorsof QZ(M), which are orthonormalindeedbecauseQZ(M) issymmetric.Thus,in theorthogonalapproach,an-otherimmediatealgorithmfor blind identificationis to estimateU asan(orthonormal)diagonalizerof anestimateof QZ(M). Thanksto sphering,problemsassociatedwith matrix inversiondisappear, but a deeperprob-lemassociatedwith thesesimplealgebraicideasremainsandmustbead-dressed.Recallthattheeigenvectorsareuniquelydetermined2 if andonlyif theeigenvaluesareall distinct.Therefore,weneedto makesurethattheeigenvaluesof QZ(M) areall distinctin orderto preserveblind identifiabil-ity basedonQZ(M). Accordingto equation3.18,theseeigenvaluesdependon the(sphered)system,which is unknown.Thus,it is not possibleto de-termineapriori if agivenmatrix M correspondsto distincteigenvaluesofQZ(M). Of course,if M is randomlychosen,thentheeigenvaluesare dis-tinct with probability1,butweneedmorethanthis in practicebecausethealgorithmsuseonly sampleestimatesof the cumulantmatrices.A smallerror in thesampleestimateof QZ(M) caninducea largedeviationof theeigenvectorsif theeigenvaluesarenot well enoughseparated.Again, thisis impossibleto guaranteea priori becauseanappropriateselectionof Mrequiresprior knowledgeabouttheunknownmixture.

In summary, thediagonalizationof asinglecumulantmatrixiscomputa-

2 In fact,determinedonly up to permutationsandsignsthatdonotmatterin anICAcontext.


tionallyattractiveandcanbeprovedtobealmostsurelyconsistent,but it isnotsatisfactorybecausethenondegeneracyof thespectrumcannotbecon-trolled.Asaresult,theestimationaccuracyfromafinitenumberof samplesdependson the unknownsystemandis therefore unpredictablein prac-tice; this lack of equivarianceis hardly acceptable.Onemayalsocriticizetheseapproacheson thegroundthat theyrely on only a smallpartof thefourth-orderinformation(summarizedin ann× n cumulantmatrix)ratherthantrying to exploit more cumulants(there are O(n4) fourth-order inde-pendentcumulantstatistics).Weexaminenexthowthesetwoproblemscanbealleviatedby jointly processingseveralcumulantmatrices.

3.3.2 RecoveringEquivariance. LetM = {M1, . . . , MP} beasetof Pma-

tricesof sizen × n anddenoteQidef= QZ(Mi) for 1 ≤ i ≤ P theassociated

cumulantmatricesfor thesphereddataZ = US. Again,asabove,for all iwehaveQi = U1iU† with 1i adiagonalmatrixgivenby equation3.18.Asameasureof nondiagonalityof amatrixF, defineOff(F) asthesumof thesquaresof thenondiagonalelements:

Off(F)def=

∑

i 6=j

(fij

)2. (3.19)

We havein particularOff(U†QiU) = Off(1i) = 0 sinceQi = U1iU† andU†U = I. ForanymatrixsetM andanyorthonormalmatrixV, wedefinethefollowing nonnegativejoint diagonalitycriterion,

DM(V)def=

∑

Mi∈MOff(V†QZ(Mi)V), (3.20)

which measureshow closeto diagonalityan orthonormalmatrix V cansimultaneouslybringthecumulantsmatricesgeneratedbyM.

To eachmatrix setM is associateda blind identificationalgorithmasfollows:(1)findaspheringmatrixW towhitenin thedataX intoZ = WX;(2)estimatethecumulantmatricesQZ(M) for all M ∈ Mbyasampleversionofequation3.16;(3)minimizethejointdiagonalitycriterion,equation3.20,thatis, makethecumulantmatricesasdiagonalaspossibleby anorthonormaltransformV; (4) estimateA asA = VW−1 or its inverseasB = V†W or thecomponentvectorasY = V†Z = V†WX.

Suchanapproachseemstobeabletoalleviatethedrawbacksmentionedabove.Findingtheorthonormaltransformastheminimizerofasetofcumu-lantmatricesgoesin theright directionbecauseit involvesalargernumberof fourth-orderstatisticsandbecauseit decreasesthelikelihoodof degener-atespectra.Thisargumentcanbemaderigorousbyconsideringamaximalsetof cumulantmatrices.By definition,this is asetobtainedwheneverMis anorthonormalbasisfor the linearspaceof n × n matrices.Sucha ba-siscontainsn2 matricessothatthecorrespondingcumulantmatricestotal


n2 × n2 = n4 entries,that is, asmanyasthewholefourth-ordercumulantset.Foranysuchmaximalset(Cardoso& Souloumiac,1993):

DM(V) = φJADE(Y) with Y = V†Z, (3.21)

whereφJADE(Y) is thecontrastfunctiondefinedat equation3.11.Thejointdiagonalizationof a maximalsetguaranteesblind identifiability of A ifk(Si) = 0for atmostoneentrySi of S(Cardoso& Souloumiac,1993).Thisisanecessaryconditionforanyalgorithmusingonlysecond-andfourth-orderstatistics(Comon,1994).

A keypointismadebyrelationship3.21.Wemanagedtoturnanalgebraicproperty(diagonality)of thecumulantsof the(sphered)observationsintoacontrastfunction—afunctionalof thedistributionof theoutputY = V†Z.This factguaranteesthattheresultingestimatesareequivariant(Cardoso,1995).

The price to pay with this techniquefor reconcilingthe algebraicap-proachwith thenaturallyequivariantcontrast-basedapproachis twofold:it entailsthe computationof a large (actually, maximal)setof cumulantmatricesandthejoint diagonalizationof P = n2 matrices,which is at leastascostlyasP timesthe diagonalizationof a singlematrix. However, theoverallcomputationalburdenmaybesimilar(seeexamplesin section5)tothecostof adaptivealgorithms.Thisisbecausethecumulantmatricesneedto beestimatedoncefor a givendatasetandbecauseit existsasa reason-ablyefficientjoint diagonalizationalgorithm(seesection4)thatisnotbasedon gradient-styleoptimization;it thuspreservesthepossibilityof exploit-ing theunderlyingalgebraicnatureof thecontrastfunction,equation3.11.Severaltricksfor increasingefficiencyarealsodiscussedin section4.

4 JacobiAlgorithms

Thissectiondescribesalgorithmsfor ICA sharinga commonfeature:a Ja-cobi optimizationof an orthogonalcontrastfunction asopposedto opti-mizationby gradient-likealgorithms.Theprincipleof Jacobioptimizationisappliedtoadata-basedalgorithm,astatistic-basedalgorithm,andamixedapproach.

The Jacobimethodis an iterative techniqueof optimizationover thesetof orthonormalmatrices.Theorthonormaltransformis obtainedasasequenceofplanerotations.Eachplanerotationisarotationappliedtoapairof coordinates(hencethename:therotationoperatesin atwo-dimensionalplane).If Y isann×1vector, the(i, j)thplanerotationbyanangleθij changesthecoordinatesi and j of Y according to

[Yi

Yj

]←

[cos(θij ) sin(θij )

− sin(θij ) cos(θij )

] [Yi

Yj

], (4.1)

whileleavingtheothercoordinatesunchanged.A sweepisonepassthrough


all the n(n − 1)/2 possiblepairsof distinct indices.This ideais classicinnumericalanalysis(Golub & Van Loan, 1989);it canbe considered in awidercontextfor theoptimizationofanyfunctionofanorthonormalmatrix.Comonintroducedthe Jacobitechniquefor ICA (seeComon,1994for adata-basedalgorithmandanearlierreferencein it for theJacobiupdateofhigh-ordercumulanttensors).Suchadata-basedJacobialgorithmfor ICAworks througha sequenceof Jacobisweepson the sphered datauntil agivenorthogonalcontrastφ(Y) is optimized.Thiscanbesummarizedas:

1. Initialization. ComputeawhiteningmatrixW andsetY = WX.

2. Onesweep. Forall n(n − 1)/2pairs,thatis for 1≤ i < j ≤ n, do:

a. ComputetheGivensangleθij , optimizingφ(Y) whenthepair(Yi, Yj) is rotated.

b. If θij < θmin,dorotatethepair(Yi, Yj) accordingtoequation4.1.

3. If nopairhasbeenrotatedin previoussweep,end.Otherwisegoto 2for anothersweep.

Thus,the Jacobiapproachconsidersa sequenceof two-dimensionalICAproblems.Of course,theupdatingstep2bon a pair (i, j) partially undoestheeffectof previousoptimizationsonpairscontainingeitheri or j.Forthisreason,it is necessaryto gothroughseveralsweepsbeforeoptimizationiscompleted.However,Jacobialgorithmsareoftenveryefficientandconvergein asmallnumberof sweeps(seetheexamplesin section5),andakeypointis thateachplanerotationdependsonasingleparameter, theGivensangleθij , reducingtheoptimizationsubproblemateachsteptoaone-dimensionaloptimizationproblem.

An importantbenefitof basingICA on fourth-ordercontrastsbecomesapparent:becausefourth-ordercontrastsarepolynomialin theparameters,theGivensanglescanoftenbefoundin closeform.

In theabovescheme,θmin is a smallangle,which controls theaccuracyof the optimization.In numericalanalysis,it is determinedaccording tomachineprecision.ForastatisticalproblemasICA, θmin shouldbeselectedinsuchawaythatrotationsbyasmalleranglearenotstatisticallysignificant.

In our experiments,we takeθmin to scaleas1/√

T, typically: θmin = 10−2√

T.

Thisscalingcanberelatedto theexistenceof a performanceboundin theorthogonalapproachto ICA(Cardoso,1994).Thisvaluedoesnotseemtobecritical, however, becausewehavefoundJacobialgorithmsto bevery fastatfinishing.

In theremainderof this section,we describethreepossibleimplemen-tationsof theseideas.Eachonecorrespondsto a differenttypeof contrastfunction andto different optionsaboutupdating.Section4.1describesadata-basedalgorithmoptimizingφOML(Y); section4.2describesastatistic-basedalgorithmoptimizingφJADE(Y);section4.3presentsamixedapproach


optimizingφSH(Y); finally, section4.4discussestherelationshipsbetweenthesecontrastfunctions.

4.1 A Data-BasedJacobiAlgorithm: MaxKurt. WestartbyaJacobitech-nique for optimizing the approximation,equation3.9, to the orthogonallikelihood.For thesakeof exposition,we considera simplifiedversionofφOML(Y) obtainedby settingk(S1) = k(S2) = . . . = k(Sn) = k, in whichcasetheminimizationof contrastfunction,equation3.9,isequivalentto theminimizationof

φMK(Y)def= −k

∑

i

QYiiii . (4.2)

Thiscriterionis alsostudiedby MoreauandMacchi(1996),whoproposeatwo-stageadaptiveprocedurefor its optimization;it alsoservesasastart-ing point for introducingtheone-stageadaptivealgorithmof CardosoandLaheld(1996).

DenoteGij (θ) theplanerotationmatrix that rotatesthepair (i, j) by anangleθ asin step2babove.Thensimpletrigonometryyields:

φMK(Gij (θ)Y) = µij − kλij cos(4(θ − Äij )), (4.3)

whereµij doesnotdependonθ andλij isnonnegative.Theprincipaldeter-minationof angleÄij is characterizedby

Äij = 1

4arctan

(4QY

iii j − 4QYijjj , QY

iiii + QYjjjj − 6QY

iijj

), (4.4)

wherearctan(y, x) denotestheangleα ∈ (−π, π ] suchthatcos(α) = x√x2+y2

andsin(α) = y√x2+y2

. If Y is a zero-meansphered vector, expression4.3

furthersimplifiesto

Äij = 1

4arctan

(4E

(Y3

i Yj − YiY3j

), E

((Y2

i − Y2j )

2 − 4Y2i Y

2j

) ). (4.5)

Thecomputationsaregivenin theappendix.It is now immediateto mini-mizeφMK(Y) for eachpair of componentsandfor eitherchoiceof thesignof k. If onelooksfor componentswith positivekurtosis(oftencalledsuper-gaussian),theminimizationof φMK(Y) is identicalto themaximizationofthesumof thekurtosisof thecomponentssincewehavek > 0in thiscase.TheGivensanglesimply is θ = Äij sincethis choicemakesthecosineinequation4.2equalto its maximumvalue.

We referto theJacobialgorithmoutlinedaboveasMaxKurt. A Matlabimplementationis listed in the appendix,whosesimplicity is consistentwith thedata-basedapproach.Note,however, thatit is alsopossibleto use


thesamecomputationsin astatistic-basedalgorithm.Ratherthanrotatingthedatathemselvesat eachstepby equation4.1,oneinsteadupdatesthesetof all fourth-ordercumulantsaccordingto thetransformationlaw,equa-tion 3.13,with theGivensanglefor eachpairstill givenby equation4.3.Inthiscase,thememoryrequirementis O(n4) for storingall thecumulantsasopposedto nT for storingthedataset.

Thecasek < 0 where, looking for light-tailedcomponents,oneshouldminimizethesumof thekurtosisissimilar.Thisapproachcouldbeextendedtokurtosisof mixedsignsbutthecontrastfunctionthenhaslesssymmetry.Thisis not includedin thisarticle.

4.1.1 Stability. Whatis theeffectof theapproximationof equalkurtosismadeto derivethesimplecontrastφMK(Y)?WhenX = ASwith Sof inde-pendentcomponents,wecanatleastusethestabilityresultof CardosoandLaheld(1996),whichappliesdirectlyto thiscontrast.Definethenormalizedkurtosisasκi = σ−4

i k(Si). ThenB = A−1 is a stablepoint of thealgorithmwith k > 0 if κi + κj > 0for all pairs1 ≤ i < j ≤ n. Thesameconditionalsoholdswith all signsreversedfor componentswith negativekurtosis.

4.2 A Statistic-BasedAlgorithm: JADE. ThissectionoutlinestheJADEalgorithm(Cardoso& Souloumiac,1993),which is specificallya statistic-basedtechnique.Wedonotneedtogointomuchdetailbecausethegeneraltechniquefollowsdirectlyfromtheconsiderationsof section3.3.TheJADEalgorithmcanbesummarizedas:

1. Initialization. EstimateawhiteningmatrixW andsetZ = WX.

2. Formstatistics. Estimateamaximalset{QZi } of cumulantmatrices.

3. Optimizean orthogonalcontrast. Find the rotationmatrix V suchthatthecumulantmatricesareasdiagonalaspossible,that is, solveV =argmin

∑i Off(V†QZ

i V).

4. Separate. EstimateA asA = VW−1 and/orestimatethecomponentsasS= A−1X = V†Z.

This is a Jacobialgorithmbecausethejoint diagonalizerat step3 is foundby a Jacobitechnique.However, theplanerotationsareappliednot to thedata(which are summarizedin the cumulantmatrices)but to the cumu-lantmatricesthemselves;thealgorithmupdatesnotdatabutmatrix-valuedstatisticsof thedata.As with MaxKurt, theGivensangleat eachstepcanbecomputedin closedform evenin thecaseof possiblycomplexmatrices(Cardoso& Souloumiac,1993).Theexplicit expressionfor theGivensan-glesisnotparticularlyenlighteningandisnotreportedhere.(TheinterestedreaderisreferredtoCardoso& Souloumiac,1993,andmayrequestaMatlabimplementationfromtheauthor.)


A keyissueistheselectionof thecumulantmatricestobeinvolvedin theestimation.As explainedin section3.2,the joint diagonalizationcriterion∑

i Off(V†QZi V) ismadeidenticalto thecontrastfunction,equation3.11,by

usingamaximalsetof cumulantmatrices.This is abit surprisingbutveryfortunate.We do not know of anyotherway for a priori selectingcumu-lant matricesthat would offer sucha property(but seethe nextsection).In anycase,it guaranteesequivariantestimatesbecausethealgorithm,al-thoughoperatingonstatisticsof thesphereddata,alsooptimizesimplicitlya functionof Y = V†Z only.

Before proceeding,we notethat truecumulantmatricescanbeexactlyjointly diagonalizedwhenthemodelholds,but this is no longerthecasewhenweprocessrealdata.First,onlysamplestatisticsareavailable;second,the modelX = AS with independententriesin S cannotbe expectedtohold accuratelyin general.This is anotherreasonthat it is importanttoselectcumulantmatricessuchthat

∑i Off(V†QZ

i V) is a contrastfunction.In thiscase,theimpossibilityof anexactjoint diagonalizationcorrespondsto the impossibilityof finding Y = BX with independententries.Makingamaximalsetof cumulantmatricesasdiagonalaspossiblecoincideswithmakingthe entriesof Y asindependentaspossibleasmeasured by (thesampleversionof) criterion3.11.

Thereareseveraloptionsfor estimatingamaximalsetofcumulantmatri-ces.Recallthatsuchasetisdefinedas{QZ(Mi)|i = 1, n2}where{Mi|i = 1, n2}isanybasisfor then2-dimensionallinearspaceofn×nmatrices.A canonicalbasisfor thisspaceis {epe†

q|1≤ p, q≤ n}, whereep is acolumnvectorwith a1 in pth positionand0’selsewhere.It is readilycheckedthat

[QZ(epe†q)]ij = Cum(Zi, Zj, Zp, Zq). (4.6)

In otherwords,theentriesof thecumulantmatricesfor thecanonicalbasisarejustthecumulantsofZ.A betterchoiceistoconsiderasymmetric/skew-symmetricbasis.DenoteMpq ann×n matrixdefinedasfollows:Mpq = epe†

p

if p = q,Mpq = 2−1/2(epe†q+eqe†

p) if p < qandMpq = 2−1/2(epe†q−eqe†

p) if p > q.ThisisanorthonormalbasisofRn×n.Wenotethatbecauseof thesymmetriesof thecumulantsQZ(epe†

q) = QZ(eqe†p) so that QZ(Mpq) = 2−1/2QZ(epe†

q) if

p < q andQZ(Mpq) = 0 if p > q. It follows that the cumulantmatricesQZ(Mpq) for p > q do not evenneedto be computed.Being identicallyzero, theydo not enterin thejoint diagonalizationcriterion.It is thereforesufficienttoestimateandtodiagonalizen+n(n−1)/2(symmetric)cumulantmatrices.

Thereisanotherideatoreducethesizeofthestatisticsneededtorepresentexhaustivelythe fourth-order information.It is, however, applicableonlywhenthemodelX = ASholds.In thiscase,thecumulantmatricesdohavethestructureshownat equation3.18,andtheir sampleestimatesarecloseto it for largeenoughT. ThenthelinearmappingM → QZ(M) hasrankn


(moreprecisely, itsrankisequalto thenumberof componentswith nonzerokurtosis)becausetherearen lineardegreesof freedomfor matricesin theform U1U†, namely, then diagonalentriesof 1. From this fact andfromthe symmetriesof the cumulants,it follows that it existsn eigenmatricesE1, . . . , En, which are orthonormal,andsatisfiesQZ(Ei) = µiEi where thescalarµi is thecorrespondingeigenvector. ThesematricesE1, . . . , En spantherangeof themappingM → QZ(M), andanymatrix M orthogonaltothemis in thekernel,thatis,QZ(M) = 0.Thisshowsthatall theinformationcontainedin QZ canbesummarizedby then eigenmatricesassociatedwiththen nonzero eigenvalues.By insertingM = uiu†

i in theexpressions3.18andusingtheorthonormalityof thecolumnsof U (that is, u†

i uj = δij ), it isreadilycheckedthatasetof eigenmatricesis {Ei = uiu†

i }.TheJADEalgorithmwasoriginally introducedasperformingICA by a

joint approximatediagonalizationof eigenmatricesin CardosoandSoulou-miac(1993),whereweadvocatedthejointdiagonalizationofonlythenmostsignificanteigenmatricesofQZ asadevicetoreducethecomputationalload(eventhoughtheeigenmatricesareobtainedattheextracostof theeigende-compositionof ann2× n2 arraycontainingall thefourth-ordercumulants).Thenumberof statisticsis reducedfrom n4 cumulantsor n(n + 1)/2 sym-metriccumulantmatricesofsizen×n toasetofneigenmatricesofsizen×n.Sucha reductionis achievedat nostatisticalloss(at leastfor largeT) onlywhenthemodelholds.Therefore,wedonotrecommendreductiontoeigen-matriceswhenprocessingdatasetsfor whichit isnotclearapriori whetherthemodelX = ASactuallyholdsto goodaccuracy. Westill referto JADEastheprocessof jointly diagonalizingamaximalsetof cumulantmatrices,evenwhenit isnot furtherreducedto then mostsignificanteigenmatrices.It shouldalsobepointedout that thedeviceof truncatingthe full cumu-lantsetby reductionto themostsignificantmatricesis expectedto destroytheequivariancepropertywhenthemodeldoesnothold.Thenextsectionshowshowtheseproblemscanbeovercomein atechniqueborrowingfromboththedata-basedapproachandthestatistic-basedapproach.

4.3 A Mixed Approach: SHIBBS. In theJADEalgorithm,amaximalsetof cumulantmatricesis computedasa way to ensure equivariancefromthejoint diagonalizationof a fixed setof cumulantmatrices.As a benefit,cumulantsare computedonly oncein a singlepassthroughthedataset,andtheJacobiupdatesareperformedonthesestatisticsratherthanonthewholedataset.This is agoodthing for datasetswith a largenumberT ofsamples.Ontheotherhand,estimatingamaximalsetrequiresO(n4T) oper-ations,andits storagerequiresO(n4) memorypositions.Thesefigurescanbecomeprohibitivewhenlookingfor alargenumberofcomponents.In con-trast,gradient-basedtechniqueshaveto storeandupdatenTsamples.Thissectiondescribesa techniquestandingbetweenthetwo extremepositionsrepresentedby theall-statisticapproachandtheall-dataapproach.


Recallthat an algorithmis equivariantassoonasits operationcanbeexpressedonly in termsof the extractedcomponentsY (Cardoso,1995).Thissuggeststhefollowing technique:

1. Initialization. Selecta fixed setM = {M1, . . . , MP} of n × n matrices.EstimateawhiteningmatrixW andsetY = WX.

2. Estimatearotation. Estimatetheset{QY(Mp)|1≤ p ≤ P} of Pcumulantmatricesandfind ajoint diagonalizerV of it.

3. Update. If V iscloseenoughtotheidentitytransform,stop.Otherwise,rotatethedata:Y ← V†Y andgoto 2.

Suchanalgorithmisequivariantthankstothereestimationof thecumulantsofYafterupdating.It isin somesensedatabasedsincetheupdatingin step3isonthedatathemselves.However, therotationmatrixto beappliedto thedatais computedin step2asin astatistic-basedprocedure.

Whatwouldbeagoodchoicefor thesetM?Thesetof n matricesM ={e1e†

1, . . . , ene†n} seemsa naturalchoice:it is anorderof magnitudesmaller

than the maximalset,which containsO(n2) matrices.The kth cumulantmatrix in sucha setis QY(eke†

k), andits (i, j)th entry is Cum(Yi, Yj, Yk, Yk),which is just ann × n squareblockof cumulantsof Y. Wecall thesetof ncumulantmatricesobtainedin this way whenk is shiftedfrom 1 to n thesetof SHIftedBlocksfor Blind Separation(SHIBBS),andwe usethesamenamefor theICA algorithmthatdeterminestherotationV by aniterativejoint diagonalizationof theSHIBBSset.

Strikinglyenough,thesmallSHIBBSsetguaranteesaperformanceiden-ticaltoJADEwhenthemodelholdsfor thefollowing reason.Considerthefinalstepof thealgorithmwhere Y is closeto S if it holdsthatX = ASwith Sof independentcomponents.ThenthecumulantmatricesQY(epe†

q) arezerofor p 6= qbecauseall thecross-cumulantsof Y arezero.Therefore,theonlynonzerocumulantmatricesusedin themaximalsetof JADEarethosecorre-spondingtoep = eq, i.e.preciselythoseincludedinSHIBBS.ThustheSHIBBSsetactuallytendsto thesetof “significanteigen-matrices”exhibitedin theprevioussection.In this sense,SHIBBSimplementsthe original programof JADE—thejoint diagonalizationof thesignificanteigen-matrices—butitdoessowithout going throughtheestimationof thewholecumulantsetandthroughthecomputationof its eigen-matrices.

Doesthe SHIBBSalgorithmcorrespondto the optimizationof a con-trastfunction?We cannotresortto the equivalenceof JADE andSHIBBSbecauseit is establishedonly whenthe modelholdsandwe are lookingfor a statementindependentof this later fact.Examinationof the joint di-agonalitycriterionfor theSHIBBSsetsuggeststhat theSHIBBStechniquesolvestheproblemof optimizing thecontrastfunctionφSH(Y) definedinequation3.12.As a matterof fact,theconditionfor a givenY to bea fixed


pointof theSHIBBSalgorithmis thatfor anypair1≤ i < j ≤ n:

∑

k

Cum(Yi, Yj, Yk, Yk) (Cum(Yi, Yi, Yk, Yk)

− Cum(Yj, Yj, Yk, Yk))

= 0, (4.7)

andwe canprovethatthis is alsothestationarityconditionof φSH(Y). Wedonot includetheproofsof thesestatements,whicharepurely technical.

4.4 Comparing Fourth-Order Orthogonal Contrasts. Wehaveconsid-eredtwo approximations,φJADE(Y) andφSH(Y), to theminimummarginalentropy/mutualinformationcontrastφMI (I), which are basedon fourth-ordercumulantsandcanbeoptimizedby Jacobitechnique.Theapproxi-mationφME

24 (Y) proposedby Comonalsobelongsto thiscategory. Onemaywonderabouttherelativestatisticalmeritsof thesethreeapproximations.ThecontrastφME

24 (Y) stemsfromanEdgeworthexpansionfor approximat-ing φME(Y), whichin turnhasbeenshownto derivefromtheML principle(seesection2).SinceML estimationoffers(asymptotic)optimalityproper-ties,onemaybetemptedtoconcludetothesuperiorityofφME

24 (Y).However,this is not thecase,asdiscussednow.

First, when the ICA model holds, it canbe shownthat eventhoughφME

24 (Y) andφJADE(Y) aredifferentcriteria,theyhavethesameasymptoticperformancewhen appliedto samplestatistics(Souloumiac& Cardoso,1991).This is alsotrueof φSH(Y) sincewehaveseenthatit is equivalenttoJADEin thiscase(amorerigorousproof is possible,basedonequation4.7,but is not included).

Second,whentheICA modeldoesnothold,thenotionof identificationaccuracydoesnot makesenseanymore,but onewould certainlyfavor anorthogonalcontrastreachingits minimum at a point ascloseaspossibleto the point where the “true” mutual informationφME(Y) is minimized.However, it seemsdifficult to find asimplecontrast(suchasthoseconsid-ered here) that would be a goodapproximationto φME(Y) for any wideclassof distributionsof X. Notethat theML argumentin favor of φME

24 (Y)

is basedon an Edgeworthexpansionthat is valid for “almost gaussian”distributions—thosedistributionsthatmakeICA very difficult andof du-bioussignificance:In practice,ICA shouldberestrictedto datasetswherethecomponentsshowasignificantamountof nongaussianity, inwhichcasetheEdgeworthexpansionscannotbeexpectedto beaccurate.

ThereisanotherwaythanEdgeworthexpansionfor arrivingatφME24 (Y).

Considercumulantmatching:the matchingof the cumulantsof Y to thecorrespondingcumulantsof ahypotheticalvectorSwith independentcom-ponents.TheorthogonalcontrastfunctionsφJADE(Y), φSH(Y), andφME

24 (Y)

canbeseenasmatchingcriteriabecausetheypenalizethedeviationof thecross-cumulantsof Y from zero (which is thevalueof cross-cumulantsofa vectorS with independentcomponents,indeed),andtheydo sounder


theconstraintthatY is white—thatis, by enforcing anexactmatchof thesecond-ordercumulantsof Y.

It is possibleto deviseanasymptoticallyoptimalmatchingcriterionbytakingintoaccountthevariabilityof thesampleestimatesof thecumulants.Sucha computationis reportedin Cardosoet al. (1996)for the matchingof all second-andfourth-ordercumulantsof complex-valuedsignals,butasimilarcomputationis possiblefor real-valuedproblems.It showsthattheoptimalweightingof thecross-cumulantsdependson thedistributionsofthecomponentssothat the“flat weighting” of all thecross-cumulants,asin equation3.10,is not thebestonein general.However, in thelimit of “al-mostgaussian”signals,theoptimalweightstendto valuescorrespondingpreciselyto thecontrastK24(Y|S) definedin equation3.5.Thisis notunex-pectedandconfirmsthat thecrudecumulantexpansionusedin derivingequation3.5is sensible,thoughnot optimal for significantlynongaussiancomponents.

It seemsfromthedefinitionsof φJADE(Y), φSH(Y), andφME24 (Y) thatthese

differentcontrastsinvolve different typesof cumulants.This is, however,an illusion becausethe compactdefinitionsgivenabovedo not takeintoaccountthesymmetriesof thecumulants:thesamecross-cumulantmaybecountedseveraltimesin eachof thesecontrasts.Forinstance,thedefinitionof JADEexcludesthecross-cumulantCum(Y1, Y1, Y2, Y3) but includesthecross-cumulantCum(Y1, Y2, Y1, Y3), which is identical.Thus,in order todetermineif anybit of fourth-orderinformationisignoredbyanyparticularcontrast,anonredundantdescriptionshouldbegiven.All thepossiblecross-cumulantscomein four differentpatternsof indices:(ijkl), (iikl), (iijj ), and(ijjj ). Nonredundantexpressionsin termsof thesepatternsarein theform:

φ[Y] = Ca

∑

i<j<k<l

eijkl + Cb

∑

i<k<l

(eiikl + ekkil + ellik )

+ Cc

∑

i<j

eiijj + Cd

∑

i<j

(eijjj + ejjj i

),

where eijkldef= Cum(Yi, Yj, Yk, Yl)

2 andtheCi ’s are numericalconstants.Itremainstocounthowmanytimesauniquecumulantappearsin theredun-dantdefinitionsof the threeapproximationsto mutual informationcon-sideredsofar. We give only the resultof this uninspiringtaskin Table1,whichshowsthatall thecross-cumulantsareactuallyincludedin thethreecontrasts,which thereforediffer only by thedifferentscalarweightsgivento eachparticulartype.It meansthatthethreecontrastsessentiallydo thesamething.Inparticular,whenthenumbernofcomponentsislargeenough,thenumberof cross-cumulantsof type[ijkl] (all indicesdistinct)growsasO(n4), while thenumberof othertypesgrowsasO(n3) atmost.Therefore,the[ijkl] typeoutnumbersall theothertypesfor largen: onemayconjecturetheequivalenceof thethreecontrastsin this limit. Unfortunately, it seems


Table1: Numberof Timesa Cross-Cumulantof a Given Type Appearsin aGivenContrast.

Constants Ca Cb Cc Cd

Pattern ijkl ii kl i ijj ijjj

ComonICA 24 12 6 4JADE 24 10 4 2SHIBBS 24 12 4 4

difficult to draw more conclusions.For instance,we havementionedtheasymptoticequivalencebetweenComon’scontrastandtheJADEcontrastfor anyn, but it doesnot revealitself directly in theweighttable.

5 A Comparisonon BiomedicalData

The performanceof the algorithmspresentedaboveis illustratedusingthe averagedevent-relatedpotential(ERP)datarecorded and processedbyMakeigandcoworkers.A detailedaccountof theiranalysisis in Makeig,Bell, Jung,andSejnowski(1997).For our comparison,we usethedatasetandthe“logistic ICA” algorithmprovidedwith version3.1of Makeig’sICAtoolbox.3 Thedatasetcontains624datapointsof averagedERPsampledfrom14EEGelectrodes.Theimplementationof thelogisticICA providedinthetoolboxis somewhatintermediatebetweenequation1.1andits off-linecounterpart:H(Y) is averagedthroughsubblocksof thedataset.Thenon-linearfunctionis takento beψ(y) = 2

1+e−y − 1 = tanhy2. This is minusthe

log-derivativeψ(y) = − r′(y)

r(y)of thedensityr(y) = β 1

cosh(y/2)(β is anormal-

izationconstant).Therefore,thismethodmaximizesoverA thelikelihoodofmodelX = ASundertheassumptionsthatShasindependentcomponentswith densitiesequalto β 1

cosh(y/2).

Figure 1 showsthe componentsYJADE producedby JADE (first col-umn) and the componentsYLICA producedby the logistic ICA includedin Makeig’stoolbox,which wasrun with all thedefaultoptions;thethirdcolumnshowsthe differencebetweenthe componentsat the samescale.This directcomparisonis madepossiblewith the following postprocess-ing: thecomponentsYLICA werenormalizedtohaveunit varianceandweresortedby increasingvaluesof kurtosis.ThecomponentsYJADE haveunitvarianceby construction;theyweresortedandtheirsignswerechangedtomatchYLICA . Figure1showsthatYJADE andYLICA essentiallyagreeon9of14components.

3 Availablefromhttp://www.cnl.salk.edu/∼scott/.


JADE Logistic ICA Difference

Figure1: Thesourcesignalsestimatedby JADEandthelogistic ICA andtheirdifferences.


Anotherillustrationof this fact is givenby thefirst row of Figure2. Theleft panelshowsthemagnitude|Cij | of theentriesof thetransfermatrix Csuchthat C YLICA = YJADE. This matrix wascomputedafter thepostpro-cessingof thecomponentsdescribedin thepreviousparagraph:it shouldbethe identity matrix if the two methodsagreed,evenonly up to scales,signs,andpermutations.Thefigure showsa strongdiagonalstructure inthe northeastblock while the disagreementbetweenthe two methodsisapparent in thegrayzoneof thesouthwestblock.Theright panelshows

thekurtosisk(YJADEi ) plottedagainstthekurtosisk(YLICA

i ). A keyobserva-tion is that the two methodsdo agreeaboutthemostkurtic components;thesealsoarethecomponentswherethetimestructureis themostvisible.In otherwords,thetwo methodsessentiallyagreewhereveranhumaneyefindsthemostvisiblestructures.Figure2alsoshowstheresultsof SHIBBSandMaxKurt. The transfermatrix C for MaxKurt is seento be more di-agonalthanthetransfermatrix for JADE,while thetransferfor SHIBBSislessdiagonal.Thus,thelogistic ICA andMaxKurt agreemoreon thisdataset.Anotherfigure(notincluded)showsthatJADEandSHIBBSarein verycloseagreementoverall components.

TheseresultsareveryencouragingbecausetheyshowthatvariousICAalgorithmsagreewherevertheyfind structure on this particulardataset.Thisisverymuchinsupportof theICA approachtotheprocessingofsignalsfor which it is not clearthat themodelholds.It leavesopenthequestionof interpretingthedisagreementbetweenthevariouscontrastfunctionsintheswampof thelow kurtosisdomain.

It turns out that the disagreementbetweenthe methodson this datasetis, in our view, an illusion. Considertheeigenvaluesλ1, . . . , λn of thecovariancematrix RX of theobservations.Theyare plottedon a dB scale(this is 10log10λi) in Figure 3. Thetwo leastsignificanteigenvaluesstandratherclearlybelowthestrongestoneswith agapof 5.5dB.Wetakethisasanindicationthatoneshouldlook for 12linearcomponentsin thisdatasetratherthan14,asin thepreviousexperiments.Theresultis ratherstriking:by runningJADEandthelogisticICA onthefirst 12principalcomponents,an excellentagreementis found overall the12extractedcomponents,asseenon Figure4. Thisobservationalsoholdsfor MaxKurt andSHIBBSasshownby Figure5.

Table2liststhenumberof floating-pointoperations(asreturnedbyMat-lab) andtheCPUtime required to run thefour algorithmson a SPARC 2workstation.TheMaxKurt techniquewasclearlythefastesthere;however,it wasapplicableonly becausewewerelookingfor componentswith posi-tive kurtosis.Thesameis truefor theversionof logistic ICA consideredinthis experiment.It is not trueof JADEor SHIBBS,which areconsistentassoonasatmostonesourcehasavanishingkurtosis,regardlessof thesignof the nonzero kurtosis(Cardoso& Souloumiac,1993).The logistic ICArequiredonly about50%more time thanJADE.TheSHIBBSalgorithmis


JADE

Logi

stic

ICA

Transfer matrix

0

0.2

0.4

0.6

0.8

1

0 5 10

0

5

10

JADE

Logi

stic

ICA

Comparing kurtosis

SHIBBS

Logi

stic

ICA

Transfer matrix

0

0.2

0.4

0.6

0.8

1

0 5 10

0

5

10

SHIBBS

Logi

stic

ICA

Comparing kurtosis

maxkurt

Logi

stic

ICA

Transfer matrix

0

0.2

0.4

0.6

0.8

1

0 5 10

0

5

10

maxkurt

Logi

stic

ICA

Comparing kurtosis

Figure2: (Left column)Absolutevaluesof thecoefficients|Cij | of amatrixrelat-ing thesignalsobtainedby two differentmethods.A perfectagreementwouldbefor C = I: deviationfromdiagonalindicatesadisagreement.Thesignalsaresortedbykurtosis,showingagoodagreementfor highkurtosis.(Rightcolumn)Comparingthe kurtosisof the sourcesestimatedby two different methods.Fromtopto bottom:JADEversuslogisticICA, SHIBBSversuslogisticICA, andmaxkurtversuslogisticICA.


1 14

−20

−10

0

10

20

Figure 3: Eigenvaluesof the covariancematrix RX of the data in dB (i.e.,10log10(λi)).

Table2: Numberof Floating-PointOperationsandCPUTime.

Flops CPUSecs. Flops CPUSecs.

Method 14Components 12Components

LogisticICA 5.05e+07 3.98 3.51e+07 3.54JADE 4.00e+07 2.55 2.19e+07 1.69SHIBBS 5.61e+07 4.92 2.47e+07 2.35MaxKurt 1.19e+07 1.09 5.91e+06 0.54

slowerthanJADEhere becausethedatasetis not largeenoughto give itanedge.Theseremarksareevenmoremarkedwhencomparingthefiguresobtainedin theextractionof 12components.It shouldbeclearthat thesefiguresdonotprovemuchbecausetheyarerepresentativeof only apartic-ular datasetandof particularimplementationsof thealgorithms,aswellasof thevariousparametersusedfor tuningthealgorithms.However, theydo disprovetheclaimthatalgebraic-cumulantmethodsareof no practicalvalue.

6 Summary and Conclusions

Thedefinitionsof classicentropic contrastsfor ICA canall beunderstoodfromanML perspective.An approximationof theKullback-Leiblerdiver-genceyieldscumulant-basedapproximationsof thesecontrasts.In theor-


JADE Logistic ICA Difference

Figure4: The12sourcesignalsestimatedbyJADEandalogisticICA outof thefirst 12principalcomponentsof theoriginaldata.


JADE

Logi

stic

ICA

Transfer matrix

0

0.2

0.4

0.6

0.8

1

0 5 10

0

5

10

JADE

Logi

stic

ICA

Comparing kurtosis

SHIBBS

Logi

stic

ICA

Transfer matrix

0

0.2

0.4

0.6

0.8

1

0 5 10

0

5

10

SHIBBS

Logi

stic

ICA

Comparing kurtosis

maxkurt

Logi

stic

ICA

Transfer matrix

0

0.2

0.4

0.6

0.8

1

0 5 10

0

5

10

maxkurt

Logi

stic

ICA

Comparing kurtosis

Figure5: Samesettingasfor Figure2but theprocessingis restrictedto thefirst12principalcomponents,showingabetteragreementamongall themethods.


thogonalapproachto ICA wheredecorrelationis enforced,thecumulant-basedcontrastscanbeoptimizedwith Jacobitechniques,operatingoneitherthedataor statisticsof thedata,namely, cumulantmatrices.Thestructureof the cumulantsin the ICA modelcanbe easilyexploitedby algebraicidentificationtechniques,but the simpleversionsof thesetechniquesarenot equivariant.Onepossibility for overcomingthis problemis to exploitthejoint algebraicstructureof severalcumulantmatrices.In particular, theJADEalgorithmbridgesthegapbetweencontrast-basedapproachesandalgebraictechniquesbecausetheJADEobjectiveisbothacontrastfunctionandtheexpressionof theeigenstructureof thecumulants.Moregenerally,thealgebraicnatureof thecumulantscanbeexploitedtoeasetheoptimiza-tion of cumulant-basedcontrastsfunctionsby Jacobitechniques.This canbedonein adata-basedor astatistic-basedmode.Thelatterhasanincreas-ing relativeadvantageasthenumberof availablesamplesincreases,but itbecomesimpracticalfor largenumbersn of componentssincethenumberof fourth-ordercumulantsgrowsasO(n4). Thiscanbeovercometo a cer-tainextentby resortingto SHIBBS,whichiterativelyrecomputesanumberO(n3) of cumulants.

An importantobjectiveof this articlewasto combattheprejudicethatcumulant-basedalgebraicmethodsare impractical.We haveshownthattheycompareverywell tostate-of-the-artimplementationsofadaptivetech-niquesonarealdataset.

Moreextensivecomparisonsremaintobedoneinvolvingvariantsof theideaspresentedhere. A techniquelike JADE is likely to chokeon a verylarge numberof components,but the SHIBBSversionis not asmemorydemanding.Similarly, theMaxKurt methodcanbeextendedto dealwithcomponentswith mixedkurtosissigns.In this respect,it is worth under-lining theanalogybetweentheMaxKurt updateandtherelativegradientupdate,equation1.1,whenfunctionH(·) is in theform of equation1.5.

A commenton tuning the algorithms:In order to codean all-purposeICA algorithmbasedongradientdescent,it is necessaryto deviseasmartlearningschedule.Thisisusuallybasedonheuristicsandrequiresthetuningof someparameters.In contrast,Jacobialgorithmsdonotneedto betunedin theirbasicversions.However,onemaythinkof improvingontheregularJacobisweepthroughall thepairsin prespecifiedorderby devisingmoresophisticatedupdatingschedules.Heuristicswould beneededthen,asinthecaseof gradientdescentmethods.

We concludewith a negativepoint aboutthe fourth-order techniquesdescribedin this article. By nature, they optimizecontrastscorrespond-ing somehowto usinglinear-cubicnonlinearfunctionsin gradient-basedalgorithms.Therefore, they lack the flexibility of adaptingthe activationfunctionsto thedistributionsof theunderlyingcomponentsasonewouldideally do andasis possiblein algorithmslike equation1.1.Evenworse,thisverytypeof nonlinearfunction(linearcubic)hasonemajordrawback:potentialsensitivityto outliers.Thiseffectdid notmanifestitself in theex-


amplespresentedin thisarticle,but it couldindeedshowup in otherdatasets.

Appendix: Derivation and Implementation of MaxKurt

A.1 GivensAnglesfor MaxKurt. An explicit formof theMaxKurtcon-trastasafunctionof theGivensanglesisderived.Forconciseness,wedenote[ijkl] = QY

ijkl andwedefine

aij = [iiii ] + [ jjjj ] − 6[iijj ]

4bij = [iii j] − [ jjj i] λij =

√a2

ij + b2ij . (A.1)

Thesumof thekurtosisfor thepair of variablesYi andYj aftertheyhavebeenrotatedbyanangleθ dependsonθ asfollows(wherewesetc= cos(θ)

ands= sin(θ)):

k(cos(θ)Yi + sin(θ)Yj) + k(− sin(θ)Yi + cos(θ)Yj) (A.2)

= c4[iiii ] + 4c3s[iii j] + 6c2s2[iijj ] + 4cs3[ijjj ] + s4[ jjjj ] (A.3)

+ s4[iiii ] − 4s3c[iii j] + 6s2c2[iijj ] − 4sc3[ijjj ] + c4[ jjjj ] (A.4)

= (c4 + s4)([iiii ] + [ jjjj ]) + 12c2s2[iijj ] + 4cs(c2 − s2)([iii j] − [ jjj i]) (A.5)

c= −8c2s2[iiii ] + [ jjjj ] − 6[iijj ]

4+ 4cs(c2 − s2)([iii j] − [ jjj i]) (A.6)

= −2sin2(2θ)aij + 2sin(2θ) cos(2θ)bijc= cos(4θ)aij + sin(4θ)bij (A.7)

= λij(cos(4θ) cos(4Äij ) + sin(4θ) sin(4Äij )

)= λij cos(4(θ − Äij )). (A.8)

wheretheangle4Äij is definedby

cos(4Äij ) =aij√

a2ij + b2

ij

sin(4Äij ) =bij√

a2ij + b2

ij

. (A.9)

This is obtainedby usingthemultilinearity andthesymmetriesof thecu-mulantsat linesA.3 andA.4, followedby elementarytrigonometrics.

If Yi andYj are zero-meanandsphered,EYiYj = δij , we have[iiii ] =QY

iiii = EY4i − 3E2Y2

i = EY4i − 3andfor i 6= j: [iii j] = QY

iii j = EY3i Yj aswell as

[iijj ] = QYiijj = EY2

i Y2j − 1.Henceanalternateexpressionfor aij andbij is:

aij = 1

4E

(Y4

i + Y4j − 6Y2

i Y2j

)bij = E

(Y3

i Yj − YiY3j

). (A.10)

It may be interestingto notethat all the momentsrequired to determinetheGivensanglefor agivenpair (i, j) canbeexpressedin termsof thetwo


variablesξij = YiYj andηij = Y2i − Y2

j . Indeed,it is easilycheckedthatfor azero-meanspheredpair (Yi, Yj), onehas

aij = 1

4E

(η2

ij − 4ξ2ij

)bij = E

(ηijξij

). (A.11)

A.2 A Simple Matlab Implementation of MaxKurt. A Matlab imple-mentationcouldbeasfollows,wherewehavetriedtomaximizereadabilitybutnot thenumericalefficiency:

function Y = maxkurt(X) %[n T] = size(X) ;Y = X - mean(X,2)*ones(1,T); % Remove the meanY = inv(sqrtm(X*X’/T))*Y ; % Sphere the dataencore = 1 ; % Go for first sweepwhile encore, encore=0;

for p=1:n-1, % These two loops gofor q=p+1:n, % through all pairs

xi = Y(p,:).*Y(q,:);eta = Y(p,:).*Y(p,:) - Y(q,:).*Y(q,:);Omega = atan2( 4*(eta*xi’), eta*eta’ - 4*(xi*xi’) );

if abs(Omega) > 0.1/sqrt(T) % A ‘statistically small’% angle

encore = 1 ; % This will not be the%last sweep

c = cos(Omega/4);s = sin(Omega/4);Y([p q],:) = [ c s ; - s c ] * Y([p q],:) ; % Plane

% rotationend

endend

endreturn

Acknowledgments

WeexpressourthankstoScottMakeigandtohiscoworkersfor releasingtothepublicdomainsomeof theircodesandtheERPdataset.Wealsothanktheanonymousreviewerswhoseconstructivecommentsgreatlyhelpedinrevisingthefirst versionof thisarticle.


References

Amari, S.-I. (1996).Neural learningin structured parameterspaces—NaturalRiemanniangradient.In Proc.NIPS.

Amari, S.-I., Cichocki, A., & Yang,H. (1996).A new learningalgorithm forblind signalseparation.In Advancesin neuralinformationprocessingsystems,8(pp.757–763).Cambridge,MA: MIT Press.

Bell, A. J.,& Sejnowski,T. J.(1995).An information-maximisationapproachtoblind separationandblind deconvolution.NeuralComputation,7, 1004–1034.

Cardoso,J.-F. (1992).Fourth-order cumulantstructure forcing. Application toblind arrayprocessing.In Proc. 6th SSAPworkshopon statisticalsignal andarrayprocessing(pp.136–139).

Cardoso,J.-F. (1994).Ontheperformanceof orthogonalsourceseparationalgo-rithms.In Proc.EUSIPCO(pp.776–779).Edinburgh.

Cardoso,J.-F. (1995).Theequivariantapproachto sourceseparation.In Proc.NOLTA (pp.55–60).

Cardoso,J.-F. (1997).Infomaxandmaximumlikelihood for sourceseparation.IEEELettersonSignalProcessing,4, 112–114.

Cardoso,J.-F. (1998).Blind signalseparation:statisticalprinciples.Proc.of theIEEE.Specialissueonblind identificationandestimation.

Cardoso,J.-F., Bose,S.,& Friedlander, B. (1996).On optimalsourceseparationbasedonsecondandfourthordercumulants.InProc.IEEEWorkshoponSSAP.Corfu,Greece.

Cardoso,J.-F.,& Laheld,B.(1996).Equivariantadaptivesourceseparation.IEEETrans.onSig.Proc.,44, 3017–3030.

Cardoso,J.-F., & Souloumiac,A. (1993).Blind beamformingfor nonGaussiansignals.IEEEProceedings-F, 140, 362–370.

Comon,P. (1994).Independentcomponentanalysis,anewconcept?SignalPro-cessing,36, 287–314.

Comon,P. (1997).Cumulanttensors.In Proc.IEEESPWorkshoponHOS. Banff,Canada.

Cover, T. M., & Thomas,J.A. (1991).Elementsof informationtheory. New York:Wiley.

DeLathauwer,L.,DeMoor,B.,& Vandewalle,J.(1996).Independentcomponentanalysisbasedon higher-orderstatisticsonly. In Proc.IEEESSAPWorkshop(pp.356–359).

Gaeta,M., & Lacoume,J.L. (1990).Sourceseparationwithout a priori knowl-edge:Themaximumlikelihoodsolution.In Proc.EUSIPCO(pp.621–624).

Golub,G. H., & Van Loan,C. F. (1989).Matrix computations. Baltimore: JohnsHopkinsUniversityPress.

MacKay, D. J. C. (1996). Maximumlikelihood and covariant algorithms for in-dependentcomponentanalysis. In preparation.Unpublishedmanuscript.

Makeig,S.,Bell, A., Jung,T.-P., & Sejnowski,T. J. (1997).Blind separationofauditoryevent-relatedbrainresponsesinto independentcomponents.Proc.Nat.Acad.Sci.USA,94, 10979–10984.

McCullagh,P. (1987).Tensormethodsin statistics. London:ChapmanandHall.


Moreau,E.,& Macchi,O. (1996).High ordercontrastsfor self-adaptivesourceseparation.InternationalJournalof AdaptiveControl andSignalProcessing,10,19–46.

Moulines,E.,Cardoso,J.-F.,& Gassiat,E.(1997).Maximumlikelihoodfor blindseparationanddeconvolutionofnoisysignalsusingmixturemodels.In Proc.ICASSP’97(pp.3617–3620).

Nadal,J.-P., & Parga,N. (1994).Nonlinearneuronsin the low-noiselimit: Afactorialcodemaximizesinformationtransfer. NETWORK,5, 565–581.

Obradovic,D., & Deco,G. (1997).Unsupervisedlearningfor blind sourcesep-aration:An information-theoreticapproach.In Proc.ICASSP(pp.127–130).

Pearlmutter, B. A., & Parra,L. C. (1996).A context-sensitivegeneralizationofICA. In InternationalConferenceonNeuralInformationProcessing(HongKong).

Pham,D.-T. (1996).Blind separationof instantaneousmixtureof sourcesvia anindependentcomponentanalysis.IEEETrans.onSig.Proc.,44, 2768–2779.

Pham,D.-T., & Garat,P. (1997).Blind separationof mixture of independentsourcesthroughaquasi-maximumlikelihoodapproach.IEEETr.SP,45,1712–1725.

Souloumiac,A.,& Cardoso,J.-F.(1991).Comparaisondemethodesdeseparationdesources.In Proc.GRETSI. JuanlesPins,France(pp.661–664).

ReceivedFebruary7,1998;acceptedJune25,1998.

high-ordercontrasts for independent component analysis · letter communicated by anthony bell...

Documents