high-ordercontrasts for independent component analysis · letter communicated by anthony bell...
TRANSCRIPT
![Page 1: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole](https://reader031.vdocuments.net/reader031/viewer/2022031007/5b8ba28d09d3f236638bb5b5/html5/thumbnails/1.jpg)
LETTER Communicatedby Anthony Bell
High-Order Contrasts for IndependentComponentAnalysis
Jean-FrancoisCardosoEcoleNationaleSuperieuredesTelecommunications,75634ParisCedex13,France
This article considershigh-order measuresof independencefor the inde-pendentcomponentanalysisproblem anddiscussestheclassof Jacobial-gorithms for their optimization. Severalimplementationsare discussed.We compare the proposed approacheswith gradient-basedtechniquesfrom the algorithmic point of view and alsoon a setof biomedicaldata.
1 Introduction
Givenann × 1 randomvectorX, independentcomponentanalysis(ICA)consistsof finding abasisof Rn onwhich thecoefficientsof X areas inde-pendentaspossible(in someappropriatesense).Thechangeof basiscanberepresentedby ann × n matrixB andthenewcoefficientsgivenby theentriesof vectorY = BX. WhentheobservationvectorX is modeledasalinearsuperpositionof sourcesignals,matrixB is understoodasaseparat-ing matrix,andvectorY = BX is avectorof sourcesignals.Two keyissuesof ICA are thedefinitionof a measure of independenceandthedesignofalgorithmsto find thechangeof basis(or separatingmatrix)B optimizingthismeasure.
Many recentcontributionsto the ICA problemin the neuralnetworkliteraturedescribestochasticgradientalgorithmsinvolving asanessentialdevicein their learningrule a nonlinearactivationfunction.Other ideasfor ICA, mostof themfoundin thesignalprocessingliterature,exploit thealgebraicstructureof high-ordermomentsof theobservations.Theyareof-tenregardedasbeingunreliable,inaccurate,slowlyconvergent,andutterlysensitiveto outliers.As a matterof fact, it is fairly easyto devisean ICAmethoddisplayingall theseflawsandworkingononlycarefully generatedsyntheticdatasets.Thismaybethereasonthatcumulant-basedalgebraicmethodsarelargelyignoredby theresearchersof theneuralnetworkcom-munity involvedin ICA. Thisarticletriesto correctthis view by showinghowhigh-ordercorrelationscanbeefficiently exploitedto revealindepen-dentcomponents.
This articledescribesseveralICA algorithmsthatmaybecalledJacobialgorithmsbecausetheyseekto maximizemeasuresof independenceby atechniqueakin to theJacobimethodof diagonalization.Thesemeasuresofindependenceare basedon fourth-order correlationsbetweenthe entriesof Y. As a benefit,thesealgorithmsevadesthecurseof gradientdescent:
NeuralComputation11, 157–192(1999) c© 1999MassachusettsInstituteof Technology
158 Jean-FrancoisCardoso
theycanmovein macroscopicstepsthroughtheparameterspace.Theyalsohaveotherbenefitsanddrawbacks,whicharediscussedin thearticleandsummarizedin afinalsection.Beforeoutliningthecontentof thisarticle,webrieflyreviewsomegradient-basedICA methodsandthenotionof contrastfunction.
1.1 Gradient Techniquesfor ICA. Manyonlinesolutionsfor ICA thathavebeenproposedrecentlyhavethemerit of a simpleimplementation.Amongtheseadaptiveprocedures,a specificclasscanbesingledout: al-gorithmsbasedon a multiplicativeupdateof anestimateB(t) of B. Thesealgorithmsupdatea separatingmatrix B(t) on receptionof a newsamplex(t) according to thelearningrule
y(t) = B(t)x(t), B(t + 1) =(I − µtH(y(t))
)B(t), (1.1)
where I denotesthen × n identitymatrix, {µt} is ascalarsequenceof pos-itive learningsteps,andH: Rn → Rn×n is avector-to-matrixfunction.Thestationarypointsof suchalgorithmsarecharacterizedby theconditionthattheupdatehaszero mean,thatis,by thecondition,
EH(Y) = 0. (1.2)
Theonlinescheme,in equation1.1,canbe(andoften is) implementedinanoff-line manner. UsingT samplesX(1), . . . , X(T), onegoesthroughthefollowing iterationswherethefield H is averagedoverall thedatapoints:
1. Initialization. Sety(t) = x(t) for t = 1, . . . , T.
2. Estimatetheaveragefield.H = 1T
∑Tt=1H(y(t)).
3. Update. If H is smallenough,stop;elseupdateeachdatapointy(t) byy(t) ← (I − µH)y(t) andgoto 2.
Thealgorithmstopsfor a (arbitrarily) small valueof the averagefield: itsolvestheestimatingequation,
1
T
T∑
t=1
H(y(t)) = 0, (1.3)
which is thesamplecounterpartof thestationarityconditionin equation1.2.
Both theonlineandoff-line schemesaregradientalgorithms:themap-pingH(·) canbeobtainedasthegradient(therelativegradient[Cardoso&Laheld,1996]orAmari’snaturalgradient[1996])of somecontrastfunction,that is, a real-valuedmeasure of how far thedistributionY is from someidealdistribution,typically a distributionof independentcomponents.In
![Page 2: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole](https://reader031.vdocuments.net/reader031/viewer/2022031007/5b8ba28d09d3f236638bb5b5/html5/thumbnails/2.jpg)
High-OrderContrastsfor IndependentComponentAnalysis 159
particular, thegradientof theinfomax—maximumlikelihood(ML) contrastyieldsafunctionH(·) in theform
H(y) = ψ(y)y† − I, (1.4)
whereψ(y) is ann × 1vectorof component-wisenonlinearfunctionswithψi(·) takento beminusthelog derivativeof thedensityof thei component(seeAmari, Cichocki,& Yang,1996,for the online versionand Pham&Garat,1997,for abatchtechnique).
1.2 The Orthogonal Approach to ICA. In the search for independentcomponents,onemay decide,asin principal componentanalysis(PCA),to requestexactdecorrelation(second-order independence)of thecompo-nents:matrixB shouldbesuchthatY = BX is “spatiallywhite,” thatis, itscovariancematrix is the identity matrix. Thealgorithmsdescribedin thisarticletakethisdesignoption,whichwecall theorthogonalapproach.
It mustbestressedthatcomponentsthatareasindependentaspossibleaccording to somemeasure of independenceare not necessarilyuncorre-latedbecauseexactindependencecannotbeachievedin mostpracticalap-plications.Thus,if decorrelationis desired,it mustbeenforcedexplicitly;the algorithmsdescribedbelow optimizeunderthe whitenessconstraintapproximationsof themutualinformationandof othercontrastfunctions(possiblydesignedto takeadvantageof thewhitenessconstraint).
Onepracticalreasonfor consideringtheorthogonalapproachis thatoff-line contrastoptimizationmay be simplified by a two-stepprocedure asfollows. First,a whitening(or “sphering”)matrix W is computedandap-plied to the data.Sincethe new dataare spatiallywhite andoneis alsolookingfor awhitevectorY, thelattercanbeobtainedonlybyanorthonor-maltransformationV of thewhiteneddatabecauseonlyorthonormaltrans-formscanpreservethewhiteness.Thus,in sucha scheme,theseparatingmatrixB is foundasaproductB = VW. Thisapproachleadsto interestingimplementationsbecausethewhiteningmatrixcanbeobtainedstraightfor-wardly asanymatrixsquarerootof theinversecovariancematrixof X andtheoptimizationof acontrastfunctionwith respectto anorthonormalma-trix canalsobeimplementedefficiently by theJacobitechniquedescribedin section4.
Theorthonormalapproachto ICA neednot beimplementedasa two-stageJacobi-basedprocedure;it alsoexistsasaone-stagegradientalgorithm(seealsoCardoso& Laheld,1996).Assumethattherelative/naturalgradientof somecontrastfunctionleadsto aparticularfunctionH(·) for theupdaterule,equation1.1,with stationarypointsgivenbyequation1.2.Thenthesta-tionarypointsfor theoptimizationof thesamecontrastfunctionwith respectto orthonormaltransformationsare characterizedby EH(Y) − H(Y)† = 0wherethesuperscript†denotestransposition.Ontheotherhand,for zero-meanvariables,thewhitenessconstraintis EYY† = I, which we canalso
160 Jean-FrancoisCardoso
writeasEYY†− I = 0.BecauseEYY†− I isasymmetricmatrix matrixwhileEH(Y) − H(Y)† is a skew-symmetricmatrix, thewhitenessconditionandthestationarityconditioncanbecombinedin a singleoneby just addingthem.Theresultingconditionis E{YY† − I + H(Y) − H(Y)†} = 0. Whenitholdstrue,both thesymmetricpart andtheskew-symmetricpart cancel;theformerexpressesthatY is white, thelatterthatthecontrastfunctionisstationarywith respectto all orthonormaltransformations.
Thus,if thealgorithmin equation1.1optimizesagivencontrastfunctionwith H givenby equation1.4,thenthesamealgorithmoptimizesthesamecontrastfunctionunderthewhitenessconstraintwith H givenby
H(y) = yy† − I + ψ(y)y† − yψ(y)†. (1.5)
It is thussimpleto implementorthogonalversionsof gradientalgorithmsoncearegularversionis available.
1.3 Data-BasedVersusStatistic-BasedTechniques.Comon(1994)com-paresthedata-basedoptionandthestatistic-basedoption for computingoff-line anICA of a batchx(1), . . . , x(T) of T samples;this articlewill alsointroduceamixedstrategy(seesection4.3).In thedata-basedoption,succes-sivelineartransformationsareappliedto thedatasetuntil somecriterionofindependenceismaximized.Thisis theiterativetechniqueoutlinedabove.Notethatit isnotnecessarytoupdateexplicitly aseparatingmatrixB in thisscheme(althoughonemaydecideto dosoin aparticularimplementation);thedatathemselvesare updateduntil theaveragefield 1
T
∑Tt=1H(y(t)) is
small enough;the transformB is implicitly containedin the setof trans-formeddata.
Anotheroptionistosummarizethedatasetintoasmallersetof statisticscomputedonceandfor all from thedataset;thealgorithmthenestimatesa separatingmatrix asa functionof thesestatisticswithout accessingthedata.Thisoptionmaybefollowedin cumulant-basedalgebraictechniqueswherethestatisticsarecumulantsof X.
1.4 Outline of the Article. In section2,theICA problemis recastin theframeworkof (blind) identification,showinghowentropiccontrastsreadilystemfromthemaximumlikelihood(ML) principle.In section3,high-orderapproximationstotheentropiccontrastsaregiven,andtheiralgebraicstruc-tureisemphasized.Section4describesdifferentflavorsofJacobialgorithmsoptimizingfourth-ordercontrastfunctions.A comparisonbetweenJacobitechniquesandagradient-basedalgorithmis givenin section5basedonarealdatasetof electroencephalogram(EEG)recordings.
![Page 3: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole](https://reader031.vdocuments.net/reader031/viewer/2022031007/5b8ba28d09d3f236638bb5b5/html5/thumbnails/3.jpg)
High-OrderContrastsfor IndependentComponentAnalysis 161
2 Contrast Functionsand Maximum Likelihood Identification
Implicitly or explicitly, ICA triesto fit amodelfor thedistributionof X thatis a modelof independentcomponents:X = AS, where A is an invertiblen×nmatrixandSisann×1vectorwith independententries.EstimatingtheparameterA fromsamplesof X yieldsaseparatingmatrixB = A−1. EvenifthemodelX = ASisnotexpectedtoholdexactlyformanyrealdatasets,onecanstill useit toderivecontrastfunctions.Thissectionexhibitsthecontrastfunctionsassociatedwith theestimationof A by theML principle(amoredetailedexpositioncanbefoundin Cardoso,1998).Blind separationbasedonML wasfirst consideredby GaetaandLacoume(1990)(but theauthorsusedcumulantapproximationsasthosedescribedin section3),PhamandGarat(1997),andAmari etal. (1996).
2.1 Likelihood. Assumethattheprobabilitydistributionof eachentrySi of Shasa densityri(·).1 Then,thedistributionPS of therandomvectorShasadensityr(·) in theform r(s) =
∏ni=1 ri(si), andthedensityof X for a
givenmixtureA andagivenprobabilitydensityr(·) is:
p(x; A, r) = | detA|−1r(A−1x), (2.1)
sothatthe(normalized)log-likelihoodLT(A, r) of T independentsamplesx(1), . . . , x(T) of X is
LT(A, r)def= 1
T
T∑
t=1
logp(x(t); A, r)
= 1
T
T∑
t=1
logr(A−1x(t)) − log | detA|. (2.2)
Dependingontheassumptionsmadeaboutthedensitiesr1, . . . , rn, severalcontrastfunctionscanbederivedfromthis log-likelihood.
2.2 Likelihood Contrast. Undermild assumptions,thenormalizedlog-likelihoodLT(A, r), which is asampleaverage,convergesfor largeT to itsensembleaverageby law of largenumbers:
LT(A, r) = 1
T
T∑
t=1
logr(A−1x(t)) − log | detA|
−→T→∞ Elogr(A−1x) − log | detA|, (2.3)
1 All densitiesconsideredin this articlearewith respectto theLebesguemeasureonR or Rn.
162 Jean-FrancoisCardoso
whichsimplemanipulations(Cardoso,1997)showtobeequalto−H(PX)−K(PY|PS). Hereandin thefollowing, H(·) andK(·|·), respectively, denotethedifferentialentropy andtheKullback-Leiblerdivergence.SinceH(PX)
doesnotdependonthemodelparameters,thelimit for largeT of −LT(A, r)is,up to aconstant,equalto
φML(Y)def= K(PY|PS). (2.4)
Therefore,theprincipleof ML coincideswith theminimizationof aspecificcontrastfunction,whichisnothingbutthe(Kullback)divergenceK(PY|PS)
betweenthedistributionPY of theoutputandamodeldistributionPS.Theclassicentropic contrastsfollow from this observation,depending
ontwooptions:(1) tryingornottoestimatePS fromthedataand(2) forcingor not thecomponentsto beuncorrelated.
2.3 Infomax. ThetechnicallysimpleststatisticalassumptionaboutPS
is to selectfixed densitiesr1, . . . , rn for eachcomponent,possiblyon thebasisof prior knowledge.ThenPS isafixeddistributionalassumption,andtheminimizationof φML(Y) is performedonly overPY via Y = BX. Thiscanbe rephrased:ChooseB suchthat Y = BX is ascloseaspossibleindistribution to the hypothesizedmodeldistributionPS, the closenessindistributionbeingmeasured in the Kullback divergence.This is alsothecontrastfunctionderivedfromtheinfomaxprinciplebyBell andSejnowski(1995).The connectionbetweeninfomax and ML wasnotedin Cardoso(1997),MacKay(1996),andPearlmutterandParra(1996).
2.4 Mutual Information. Thetheoreticallysimpleststatisticalassump-tion aboutPS is to assumeno modelat all. In this case,theKullbackmis-matchK(PY|PS) shouldbe minimizednot only by optimizing over B tochangethedistributionof Y = BXbutalsowith respecttoPS.ForeachfixedB, that is, for eachfixed distributionPY, theresultof this minimizationistheoretically very simple:theminimumis reachedwhenPS = PY, whichdenotesthe distributionof independentcomponentswith eachmarginaldistributionequalto the correspondingmarginal distributionof Y. Thisstemsfromthepropertythat
K(PY|PS) = K(PY|PY) + K(PY|PS) (2.5)
for anydistributionPS with independentcomponents(Cover& Thomas,1991).Therefore,theminimuminPSof K(PY|PS) is reachedbytakingPS =PY sincethis choiceensuresK(PY|PS) = 0.Thevalueof φML at this pointthenis
φMI (Y)def= min
PS
K(PY|PS) = K(PY|PY). (2.6)
![Page 4: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole](https://reader031.vdocuments.net/reader031/viewer/2022031007/5b8ba28d09d3f236638bb5b5/html5/thumbnails/4.jpg)
High-OrderContrastsfor IndependentComponentAnalysis 163
We usethe indexMI sincethis quantityis well known asthe mutual in-formationbetweentheentriesof Y. It wasfirst proposedby Comon(1994),andit canbeseenfromtheaboveasderivingfromtheML principlewhenoptimizationis with respectto boththeunknownsystemA andthedistri-butionof S. Thisconnectionwasalsonotedin ObradovicandDeco(1997),andtherelationbetweeninfomaxandmutualinformationisalsodiscussedin NadalandParga(1994).
2.5 Minimum Marginal Entropy. An orthogonalcontrastφ(Y) is, bydefinition, to beoptimizedundertheconstraintthat Y is spatiallywhite:orthogonalcontrastsenforcedecorrelation,thatis,anexact“second-order”independence.Any regularcontrastcanbeusedunderthewhitenesscon-straint,but by taking the whitenessconstraintinto account,the contrastmaybegivenasimplerexpression.Thisis thecaseof somecumulant-basedcontrastsdescribedin section3.It isalsothecaseof φMI (Y) becausethemu-tual informationcanalsobeexpressedasφMI (Y) =
∑ni=1H(PYi ) − H(PY);
sincethe entropy H(PY) is constantunderorthonormaltransforms,it isequivalentto consider
φME(Y) =n∑
i=1
H(PYi ) (2.7)
to be optimizedunderthe whitenessconstraintEYY† = I. This contrastcouldbecalledorthogonalmutualinformation,orthemarginalentropycon-trast.Theminimumentropyideaholdsmoregenerallyunderanyvolume-preservingtransform(Obradovic& Deco,1997).
2.6 Empirical Contrast Functions. Amongall theabovecontrasts,onlyφML or its orthogonalversionareeasilyoptimizedby agradienttechniquebecausetherelativegradientof φML simply is thematrix EH(Y) with H(·)definedin equation1.4.Therefore,therelativegradientalgorithm,equation1.1,canbeemployedusingeitherthisfunctionH(·) oritssymmetrizedform,equation1.5,if onechoosestoenforcedecorrelation.However, thiscontrastis basedon a prior guessPS aboutthedistributionof thecomponents.Iftheguessis toofaroff, thealgorithmwill fail todiscoverindependentcom-ponentsthat might be presentin the data.Unfortunately, evaluatingthegradientof contrastsbasedon mutualinformationor minimummarginalentropy is more difficult becauseit doesnot reduceto theexpectationofa simplefunctionof Y; for instance,Pham(1996)minimizesexplicitly themutualinformation,but thealgorithminvolvesa kernelestimationof themarginaldistributionsof Y.An intermediateapproachistoconsiderapara-metricestimationof thesedistributionsasin Moulines,Cardoso,andGas-siat(1997)or PearlmutterandParra(1996),for instance.Therefore,all thesecontrastsrequirethatthedistributionsof componentsbeknown,orapprox-
164 Jean-FrancoisCardoso
imatedor estimated.As we shallseenext,this is alsowhat thecumulantapproximationsto contrastfunctionsareimplicitly doing.
3 Cumulants
This sectionpresentshigher-order approximationsto entropic contrasts,someknownandsomenovel.To keeptheexpositionsimple,it is restrictedto symmetricdistributions(for whichodd-ordercumulantsareidenticallyzero) andto cumulantsof orders2and4.Recallthatfor randomvariables
X1, . . . , X4,second-ordercumulantsareCum(X1, X2)def= EX1X2whereXi
def=Xi − EXi andthefourth-ordercumulantsare
Cum(X1, X2, X3, X4) =EX1X2X3X4 − EX1X2EX3X4 − EX1X3EX2X4 − EX1X4EX2X3. (3.1)
Thevarianceandthekurtosisof arealrandomvariableX aredefinedas
σ2(X)def= Cum(X, X) = EX2,
k(X)def= Cum(X, X, X, X) = EX4 − 3E2X2, (3.2)
that is, theyare thesecond-andfourth-orderautocumulants.A cumulantinvolving at leasttwo differentvariablesis calledacross-cumulant.
3.1 Cumulant-Based Approximations to Entropic Contrasts. Cumu-lantsare usefulin manyways.In this section,theyshowup becausetheprobabilitydensityof ascalarrandomvariableU closeto thestandard nor-maln(u) = (2π)−1/2exp−u2/2canbeapproximatedas
p(u) ≈ n(u)
(1+ σ2(U) − 1
2h2(u) + k(U)
4!h4(u)
), (3.3)
whereh2(u) = u2− 1andh4(u) = u4− 6u2+ 3,respectively, arethesecond-andfourth-orderHermitepolynomials.Thisexpressionis obtainedby re-taining the leadingtermsin an Edgeworthexpansion(McCullagh,1987).If U andV are two real randomvariableswith distributionscloseto thestandard normal,onecan,at leastformally, useexpansion3.3to deriveanapproximationto K(PU|PV). Thisis
K(PU|PV) ≈ 1
4(σ2(U) − σ2(V))2 + 1
48(k(U) − k(V))2, (3.4)
whichshowshowthepair(σ2, k) ofcumulantsoforder2and4playin somesensetheroleof a localcoordinatesystemaroundn(u) with thequadratic
![Page 5: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole](https://reader031.vdocuments.net/reader031/viewer/2022031007/5b8ba28d09d3f236638bb5b5/html5/thumbnails/5.jpg)
High-OrderContrastsfor IndependentComponentAnalysis 165
form 3.4playingtherole of a local metric.This resultgeneralizesto mul-tivariates,in whichcasewedenotefor concisenessRU
ij = Cum(Ui, Uj) and
QUijkl = Cum(Ui, Uj, Uk, Ul) andsimilarly for anotherrandomn-vectorV
with entriesV1, . . . , Vn. We give without proof the following approxima-tion:
K(PU|PV) ≈ K24(PU|PV)def= 1
4
∑
ij
(RU
ij − RVij
)2
+ 1
48
∑
ijkl
(QU
ijkl − QVijkl
)2. (3.5)
Expression3.5turnsout to bethesimplestpossiblemultivariategeneral-izationof equation3.4(thetwo termsin equation3.5areadoublesumoverall then2 pairsof indicesandaquadruplesoverall then4 quadruplesof in-dices).Sincetheentropiccontrastslistedabovehaveall beenderivedfromthe Kullback divergence,cumulantapproximationsto all thesecontrastscanbeobtainedby replacingtheKullbackmismatchK(PU|PV) byacrudermeasure:its approximationis acumulantmismatchby equation3.5.
3.1.1 ApproximationtotheLikelihoodContrast. Theinfomax-MLcontrastφML(Y) = K(PY|PS) for ICA (seeequation2.4)is readilyapproximatedbyusingexpression3.5.TheassumptionPS on the distributionof S is nowreplacedby anassumptionaboutthecumulantsof S. Thisamountsto verylittle: all thecross-cumulantsof Sbeing0thanksto theassumptionof inde-pendentsources,it is neededonly to specifytheautocumulantsσ2(Si) andk(Si). Thecumulantapproximation(seeequation3.5) to the infomax-MLcontrastbecomes:
φML(Y) ≈ K24(PY|PS) = 1
4
∑
ij
(RY
ij − σ2(Si)δij
)2
+ 1
48
∑
ijkl
(QY
ijkl − k(Si)δijkl
)2, (3.6)
wheretheKroneckersymbolδ equals1with identicalindicesand0other-wise.
3.1.2 Approximationto theMutual InformationContrast. Themutualin-formationcontrastφMI (Y) wasobtainedby minimizing K(PY|PS) overallthe distributionsPS with independentcomponents.In the cumulantap-proximation,thisis trivially done:thefreeparametersforPS areσ2(Si) andk(Si).Eachof thesescalarsentersin onlyonetermof thesumsin equation3.6sothat theminimizationis achievedfor σ2(Si) = RY
ii andk(Si) = QYiiii . In
otherwords,theconstructionof thebestapproximatingdistributionwith
166 Jean-FrancoisCardoso
independentmarginalsPY, which appearsin equation2.5,boils down,inthecumulantapproximation,to theestimationof thevarianceandkurtosisof eachentryof Y. Fitting bothσ2(Si) andk(Si) to RY
ii andQYiii , respectively,
hastheeffectofexactlycancellingthediagonaltermsinequation3.6,leavingonly
φMI (Y) ≈ φMI24 (Y)
def= 1
4
∑
ij 6=ii
(RY
ij
)2+ 1
48
∑
ijkl 6=iiii
(QY
ijkl
)2, (3.7)
whichisourcumulantapproximationto themutualinformationcontrastinequation2.6.Thefirst termisunderstoodasthesumoverall thepairsofdis-tinct indices;thesecondtermisasumoverall quadruplesof indicesthatarenotall identical.It containsonlyoff-diagonalterms,thatis,cross-cumulants.Sincecross-cumulantsof independentvariablesidenticallyvanish,it is notsurprisingtoseethemutualinformationapproximatedbyasumofsquaredcross-cumulants.
3.1.3 Approximationto theOrthogonalLikelihoodContrast. Thecumulantapproximationto theorthogonallikelihoodis fairly simple.Theorthogonalapproachconsistsof first enforcing thewhitenessof Y that is RY
ij = δij or
RY = I. Inotherwords,it consistsofnormalizingthecomponentsbyassum-ingthatσ2(Si) = 1andmakingsurethesecond-ordermismatchiszero.Thisisequivalentto replacingtheweight 1
4 in equation3.6byaninfiniteweight,hencereducingtheproblemto theminimization(underthewhitenesscon-straint)of the fourth-order mismatch,or the second(quadruple) suminequation3.6.Thus,theorthogonallikelihoodcontrastis approximatedby
φOML24 (Y)
def= 1
48
∑
ijkl
(QY
ijkl − k(Si)δijkl
)2. (3.8)
Thiscontrasthasaninterestingalternateexpression.Developingthesquaresgives
φOML24 (Y) = 1
48
∑
ijkl
(QYijkl)
2 + 1
48
∑
ijkl
k2(Si)δ2ijkl − 2
48
∑
ijkl
k(Si)δijklQYijkl.
Thefirstsumaboveisconstantunderthewhitenessconstraint(thisisreadilycheckedusingequation3.13for anorthonormaltransform),andthesecondsum doesnot dependon Y; finally the last sum containsonly diagonalnonzero terms.It follows that:
φOML24 (Y)
c= − 1
24
∑
i
k(Si)QYiiii
= − 1
24
∑
i
k(Si)k(Yi)c= − 1
24
∑
i
k(Si)EY4i , (3.9)
![Page 6: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole](https://reader031.vdocuments.net/reader031/viewer/2022031007/5b8ba28d09d3f236638bb5b5/html5/thumbnails/6.jpg)
High-OrderContrastsfor IndependentComponentAnalysis 167
where· c= · denotesanequalityuptoaconstant.An interpretationof thesec-ondequalityisthatthecontrastisminimizedbymaximizingthescalarprod-uctbetweenthevector[k(Y1), . . . , k(Yn)] of thekurtosisof thecomponentsandthe correspondingvectorof hypothesizedkurtosis[k(S1), . . . , k(Sn)].Thelastequalitystemsfrom thedefinition in equation3.2of thekurtosisandtheconstancyof EY2
i underthewhitenessconstraint.This lastform is
remarkablebecauseit showsthat for zero-meanobservations,φOML24 (Y)
c=El(Y),wherel(Y) = − 1
24
∑i k(Si)Y4
i ,sothecontrastisjusttheexpectationofasimplefunctionof Y.Wecanexpectsimpletechniquesfor itsmaximization.
3.1.4 Approximationto theMinimumMarginal EntropyContrast. Underthewhitenessconstraint,thefirstsumin theapproximation,equation3.7,iszero (this is thewhitenessconstraint)sothattheapproximationto mutualinformationφMI (Y) reducesto thelastterm:
φME(Y) ≈ φME24 (Y)
def= 1
48
∑
ijkl 6=iiii
(QY
ijkl
)2 c= − 1
48
∑
i
(QY
iiii
)2. (3.10)
Again, thelastequalityup to constantfollows from theconstancyof∑
ijkl
(QYijkl)
2 underthewhitenessconstraint.Theseapproximationshadalreadybeenobtainedby Comon(1994)from anEdgeworthexpansion.Theysaysomethingsimple:Edgeworthexpansionssuggesttestingtheindependencebetweentheentriesof Y by summingup all thesquaredcross-cumulants.
In thecourseof this article,we will find two similar contrastfunctions.TheJADEcontrast,
φJADE(Y)def=
∑
ijkl 6=iikl
(QY
ijkl
)2, (3.11)
alsois a sumof squaredcross-cumulants(thenotationindicatesa sumisoverall thequadruples(ijkl) of indiceswith i 6= j). Its interestis to bealsoacriterionof joint diagonalityof cumulantsmatrices.TheSHIBBScriterion,
φSH(Y)def=
∑
ijkl 6=iikk
(QY
ijkl
)2, (3.12)
is alsointroducedin section4.3asgoverninga similar but lessmemory-demandingalgorithm.It also involvesonly cross-cumulants:thosewithindices(ijkl) suchthat i 6= j or k 6= l.
3.2 Cumulants and Algebraic Structures. Previoussectionsreviewedthe useof cumulantsin designingcontrastfunctions.Another threadofideasusingcumulantsstemsfrom the methodof moments.Suchan ap-proachis calledfor by themultilinearity of thecumulants.Undera linear
168 Jean-FrancoisCardoso
transformY = BX,whichalsoreadsYi =∑
p bipXp, thecumulantsof order4(for instance)transformas:
Cum(Yi, Yj, Yk, Yl) =∑
pqrs
bipbjqbkrblsCum(Xp, Xq, Xr, Xs), (3.13)
whichcaneasilybeexploitedfor ourpurposessincetheICA modelislinear.Usingthis factandtheassumptionof independenceby whichCum(Sp, Sq,
Sr, Ss) = k(Sp)δ(p, q, r, s),wereadilyobtainthesimplealgebraicstructureofthecumulantsof X = ASwhenShasindependententries,
Cum(Xi, Xj, Xk, Xl) =n∑
u=1
k(Su)aiuajuakualu, (3.14)
whereaij denotesthe(ij)th entryof matrix A. WhenestimatesCum(Xi, Xj,
Xk, Xl) areavailable,onemaytry to solveequation3.4in thecoefficientsaij
of A. Thisis tantamountto cumulantmatchingontheempiricalcumulantsof X.Becauseof thestrongalgebraicstructureof equation3.14,onemaytryto devisefourth-orderfactorizationsakin to thefamiliar second-ordersin-gularvaluedecomposition(SVD)or eigenvaluedecomposition(EVD) (seeCardoso,1992;Comon,1997;DeLathauwer,DeMoor,& Vandewalle,1996).However, theseapproachesaregenerallynotequivalenttotheoptimizationof acontrastfunction,resultingin estimatesthataregenerallynotequivari-ant(Cardoso,1995).Thispoint is illustratedbelow;weintroducecumulantmatriceswhosesimplestructureoffersstraightforward identificationtech-niques,but we stress,asoneof their importantdrawbacks,their lack ofequivariance.However, weconcludeby showinghow thealgebraicpointof view andthestatistical(equivariant)pointof view canbereconciled.
3.2.1 CumulantMatrices. Thealgebraicnatureof cumulantsis tensorial(McCullagh,1987),butsincewewill concernourselvesmainlywith second-andfourth-orderstatistics,amatrix-basednotationsufficesfor thepurposeof ourexposition;weonly introducethenotionof cumulantmatrixdefinedasfollows. Givena randomn × 1 vectorX andany n × n matrix M, wedefinetheassociatedcumulantmatrix QX(M) asthen × n matrix definedcomponent-wiseby
[QX(M)]ijdef=
n∑
k,l=1
Cum(Xi, Xj, Xk, Xl)Mkl. (3.15)
If X is centered,thedefinitionin equation3.1showsthat
QX(M) = E{(X†MX) XX†} − RXtr(MRX) − RXMRX − RXM†RX,(3.16)
![Page 7: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole](https://reader031.vdocuments.net/reader031/viewer/2022031007/5b8ba28d09d3f236638bb5b5/html5/thumbnails/7.jpg)
High-OrderContrastsfor IndependentComponentAnalysis 169
where tr(·) denotesthe traceandRX denotesthe covariancematrix of X,that is, [RX]ij = Cum(Xi, Xj). Equation3.16couldhavebeenchosenasanindex-freedefinitionof cumulantmatrices.It showsthatagivencumulantmatrixcanbecomputedor estimatedatacostsimilarto theestimationcostof acovariancematrix;thereisnoneedto computethewholesetof fourth-ordercumulantsto obtainthevalueof QX(M) for a particularvalueof M.Actually, estimatinga particularcumulantmatrix is oneway of collectingpartof thefourth-orderinformationin X; collectingthewholefourth-orderinformationrequirestheestimationof O(n4) fourth-ordercumulants.
Thestructure of a cumulantmatrix QX(M) in the ICA modelis easilydeducedfromequation3.14:
QX(M) = A1(M)A†
1(M) = Diag(k(S1) a†
1Ma1, . . . , k(Sn) a†nMan
), (3.17)
whereai denotestheith columnof A, thatis,A = [a1, . . . , an]. In thisfactor-ization,the(generallyunknown)kurtosisenteronly in thediagonalmatrix1(M), a fact implicitly exploitedby thealgebraictechniquesdescribedbe-low.
3.3 Blind Identification UsingAlgebraic Structures. Insection3.1,con-trast functionswere derivedfrom the ML principle assumingthe modelX = AS. In thissection,weproceedsimilarly: weconsidercumulant-basedblind identificationof A assumingX = ASfrom which thestructures3.14and3.17result.
Recallthattheorthogonalapproachcanbeimplementedbyfirstsphering
explicitly vectorX. LetW beawhitening,anddenoteZdef= WX thesphered
vector.Withoutlossofgenerality, themodelcanbenormalizedbyassumingthat the entriesof S haveunit varianceso that S is spatiallywhite. Since
Z = WX = WASisalsowhitebyconstruction,thematrixUdef= WAmustbe
orthonormal:UU† = I. ThereforespheringyieldsthemodelZ = USwithU orthonormal.Of course,this is still amodelof independentcomponentssothat,similar to equation3.17,wehavefor anymatrixM thestructureofthecorrespondingcumulantmatrixof Z,
QZ(M) = U1(M)U†
1(M) = Diag(k(S1) u†
1Mu1, . . . , k(Sn) u†nMun
), (3.18)
where ui denotesthe ith columnof U. In a practicalorthogonalstatistic-basedtechnique,onewould first estimatea whiteningmatrix W, estimatesomecumulantsof Z = WX, computean orthonormalestimateU of Uusingthesecumulants,andfinally obtainanestimateA of A asA = W−1Uor obtainaseparatingmatrixasB = U−1W = U†W.
170 Jean-FrancoisCardoso
3.3.1 NonequivariantBlind IdentificationProcedures. Wefirstpresenttwoblind identificationproceduresthatexploitin astraightforward mannerthestructure3.17;weexplainwhy, in spiteof attractivecomputationalsimplic-ity, theyarenotwell behaved(notequivariant)andhow theycanbefixedfor equivariance.
Thefirst ideais not basedon anorthogonalapproach.Let M1 andM2
be two arbitrary n × n matrices,and defineQ1def= QX(M1) and Q2
def=QX(M2). According to equation3.17,if X = ASwehaveQ1 = A11A† and
Q2 = A12A† with 11 and12 two diagonalmatrices.Thus,Gdef= Q1Q
−12 =
(A11A†)(A12A†)−1 = A1A−1, where 1 is thediagonalmatrix 111−12 . It
follows thatGA = A1, meaningthatthecolumnsof A aretheeigenvectorsof G (possiblyup to scalefactors).
An extremelysimplealgorithmfor blind identificationof A follows: Se-lect two arbitrarymatricesM1 andM2; computesampleestimatesQ1 andQ2 usingequation3.16;find thecolumnsof A astheeigenvectorsof Q1Q
−12 .
Thereisat leastoneproblemwith thisidea:wehaveassumedinvertiblema-tricesthroughoutthederivation,andthismayleadto instability. However,thisspecificproblemmaybefixedby sphering,asexaminednext.
Considernowtheorthogonalapproachasoutlinedabove.LetM besomearbitrarymatrix M, andnotethatequation3.18is aneigendecomposition:the columnsof U are the eigenvectorsof QZ(M), which are orthonormalindeedbecauseQZ(M) issymmetric.Thus,in theorthogonalapproach,an-otherimmediatealgorithmfor blind identificationis to estimateU asan(orthonormal)diagonalizerof anestimateof QZ(M). Thanksto sphering,problemsassociatedwith matrix inversiondisappear, but a deeperprob-lemassociatedwith thesesimplealgebraicideasremainsandmustbead-dressed.Recallthattheeigenvectorsareuniquelydetermined2 if andonlyif theeigenvaluesareall distinct.Therefore,weneedto makesurethattheeigenvaluesof QZ(M) areall distinctin orderto preserveblind identifiabil-ity basedonQZ(M). Accordingto equation3.18,theseeigenvaluesdependon the(sphered)system,which is unknown.Thus,it is not possibleto de-termineapriori if agivenmatrix M correspondsto distincteigenvaluesofQZ(M). Of course,if M is randomlychosen,thentheeigenvaluesare dis-tinct with probability1,butweneedmorethanthis in practicebecausethealgorithmsuseonly sampleestimatesof the cumulantmatrices.A smallerror in thesampleestimateof QZ(M) caninducea largedeviationof theeigenvectorsif theeigenvaluesarenot well enoughseparated.Again, thisis impossibleto guaranteea priori becauseanappropriateselectionof Mrequiresprior knowledgeabouttheunknownmixture.
In summary, thediagonalizationof asinglecumulantmatrixiscomputa-
2 In fact,determinedonly up to permutationsandsignsthatdonotmatterin anICAcontext.
![Page 8: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole](https://reader031.vdocuments.net/reader031/viewer/2022031007/5b8ba28d09d3f236638bb5b5/html5/thumbnails/8.jpg)
High-OrderContrastsfor IndependentComponentAnalysis 171
tionallyattractiveandcanbeprovedtobealmostsurelyconsistent,but it isnotsatisfactorybecausethenondegeneracyof thespectrumcannotbecon-trolled.Asaresult,theestimationaccuracyfromafinitenumberof samplesdependson the unknownsystemandis therefore unpredictablein prac-tice; this lack of equivarianceis hardly acceptable.Onemayalsocriticizetheseapproacheson thegroundthat theyrely on only a smallpartof thefourth-orderinformation(summarizedin ann× n cumulantmatrix)ratherthantrying to exploit more cumulants(there are O(n4) fourth-order inde-pendentcumulantstatistics).Weexaminenexthowthesetwoproblemscanbealleviatedby jointly processingseveralcumulantmatrices.
3.3.2 RecoveringEquivariance. LetM = {M1, . . . , MP} beasetof Pma-
tricesof sizen × n anddenoteQidef= QZ(Mi) for 1 ≤ i ≤ P theassociated
cumulantmatricesfor thesphereddataZ = US. Again,asabove,for all iwehaveQi = U1iU† with 1i adiagonalmatrixgivenby equation3.18.Asameasureof nondiagonalityof amatrixF, defineOff(F) asthesumof thesquaresof thenondiagonalelements:
Off(F)def=
∑
i 6=j
(fij
)2. (3.19)
We havein particularOff(U†QiU) = Off(1i) = 0 sinceQi = U1iU† andU†U = I. ForanymatrixsetM andanyorthonormalmatrixV, wedefinethefollowing nonnegativejoint diagonalitycriterion,
DM(V)def=
∑
Mi∈MOff(V†QZ(Mi)V), (3.20)
which measureshow closeto diagonalityan orthonormalmatrix V cansimultaneouslybringthecumulantsmatricesgeneratedbyM.
To eachmatrix setM is associateda blind identificationalgorithmasfollows:(1)findaspheringmatrixW towhitenin thedataX intoZ = WX;(2)estimatethecumulantmatricesQZ(M) for all M ∈ Mbyasampleversionofequation3.16;(3)minimizethejointdiagonalitycriterion,equation3.20,thatis, makethecumulantmatricesasdiagonalaspossibleby anorthonormaltransformV; (4) estimateA asA = VW−1 or its inverseasB = V†W or thecomponentvectorasY = V†Z = V†WX.
Suchanapproachseemstobeabletoalleviatethedrawbacksmentionedabove.Findingtheorthonormaltransformastheminimizerofasetofcumu-lantmatricesgoesin theright directionbecauseit involvesalargernumberof fourth-orderstatisticsandbecauseit decreasesthelikelihoodof degener-atespectra.Thisargumentcanbemaderigorousbyconsideringamaximalsetof cumulantmatrices.By definition,this is asetobtainedwheneverMis anorthonormalbasisfor the linearspaceof n × n matrices.Sucha ba-siscontainsn2 matricessothatthecorrespondingcumulantmatricestotal
172 Jean-FrancoisCardoso
n2 × n2 = n4 entries,that is, asmanyasthewholefourth-ordercumulantset.Foranysuchmaximalset(Cardoso& Souloumiac,1993):
DM(V) = φJADE(Y) with Y = V†Z, (3.21)
whereφJADE(Y) is thecontrastfunctiondefinedat equation3.11.Thejointdiagonalizationof a maximalsetguaranteesblind identifiability of A ifk(Si) = 0for atmostoneentrySi of S(Cardoso& Souloumiac,1993).Thisisanecessaryconditionforanyalgorithmusingonlysecond-andfourth-orderstatistics(Comon,1994).
A keypointismadebyrelationship3.21.Wemanagedtoturnanalgebraicproperty(diagonality)of thecumulantsof the(sphered)observationsintoacontrastfunction—afunctionalof thedistributionof theoutputY = V†Z.This factguaranteesthattheresultingestimatesareequivariant(Cardoso,1995).
The price to pay with this techniquefor reconcilingthe algebraicap-proachwith thenaturallyequivariantcontrast-basedapproachis twofold:it entailsthe computationof a large (actually, maximal)setof cumulantmatricesandthejoint diagonalizationof P = n2 matrices,which is at leastascostlyasP timesthe diagonalizationof a singlematrix. However, theoverallcomputationalburdenmaybesimilar(seeexamplesin section5)tothecostof adaptivealgorithms.Thisisbecausethecumulantmatricesneedto beestimatedoncefor a givendatasetandbecauseit existsasa reason-ablyefficientjoint diagonalizationalgorithm(seesection4)thatisnotbasedon gradient-styleoptimization;it thuspreservesthepossibilityof exploit-ing theunderlyingalgebraicnatureof thecontrastfunction,equation3.11.Severaltricksfor increasingefficiencyarealsodiscussedin section4.
4 JacobiAlgorithms
Thissectiondescribesalgorithmsfor ICA sharinga commonfeature:a Ja-cobi optimizationof an orthogonalcontrastfunction asopposedto opti-mizationby gradient-likealgorithms.Theprincipleof Jacobioptimizationisappliedtoadata-basedalgorithm,astatistic-basedalgorithm,andamixedapproach.
The Jacobimethodis an iterative techniqueof optimizationover thesetof orthonormalmatrices.Theorthonormaltransformis obtainedasasequenceofplanerotations.Eachplanerotationisarotationappliedtoapairof coordinates(hencethename:therotationoperatesin atwo-dimensionalplane).If Y isann×1vector, the(i, j)thplanerotationbyanangleθij changesthecoordinatesi and j of Y according to
[Yi
Yj
]←
[cos(θij ) sin(θij )
− sin(θij ) cos(θij )
] [Yi
Yj
], (4.1)
whileleavingtheothercoordinatesunchanged.A sweepisonepassthrough
![Page 9: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole](https://reader031.vdocuments.net/reader031/viewer/2022031007/5b8ba28d09d3f236638bb5b5/html5/thumbnails/9.jpg)
High-OrderContrastsfor IndependentComponentAnalysis 173
all the n(n − 1)/2 possiblepairsof distinct indices.This ideais classicinnumericalanalysis(Golub & Van Loan, 1989);it canbe considered in awidercontextfor theoptimizationofanyfunctionofanorthonormalmatrix.Comonintroducedthe Jacobitechniquefor ICA (seeComon,1994for adata-basedalgorithmandanearlierreferencein it for theJacobiupdateofhigh-ordercumulanttensors).Suchadata-basedJacobialgorithmfor ICAworks througha sequenceof Jacobisweepson the sphered datauntil agivenorthogonalcontrastφ(Y) is optimized.Thiscanbesummarizedas:
1. Initialization. ComputeawhiteningmatrixW andsetY = WX.
2. Onesweep. Forall n(n − 1)/2pairs,thatis for 1≤ i < j ≤ n, do:
a. ComputetheGivensangleθij , optimizingφ(Y) whenthepair(Yi, Yj) is rotated.
b. If θij < θmin,dorotatethepair(Yi, Yj) accordingtoequation4.1.
3. If nopairhasbeenrotatedin previoussweep,end.Otherwisegoto 2for anothersweep.
Thus,the Jacobiapproachconsidersa sequenceof two-dimensionalICAproblems.Of course,theupdatingstep2bon a pair (i, j) partially undoestheeffectof previousoptimizationsonpairscontainingeitheri or j.Forthisreason,it is necessaryto gothroughseveralsweepsbeforeoptimizationiscompleted.However,Jacobialgorithmsareoftenveryefficientandconvergein asmallnumberof sweeps(seetheexamplesin section5),andakeypointis thateachplanerotationdependsonasingleparameter, theGivensangleθij , reducingtheoptimizationsubproblemateachsteptoaone-dimensionaloptimizationproblem.
An importantbenefitof basingICA on fourth-ordercontrastsbecomesapparent:becausefourth-ordercontrastsarepolynomialin theparameters,theGivensanglescanoftenbefoundin closeform.
In theabovescheme,θmin is a smallangle,which controls theaccuracyof the optimization.In numericalanalysis,it is determinedaccording tomachineprecision.ForastatisticalproblemasICA, θmin shouldbeselectedinsuchawaythatrotationsbyasmalleranglearenotstatisticallysignificant.
In our experiments,we takeθmin to scaleas1/√
T, typically: θmin = 10−2√
T.
Thisscalingcanberelatedto theexistenceof a performanceboundin theorthogonalapproachto ICA(Cardoso,1994).Thisvaluedoesnotseemtobecritical, however, becausewehavefoundJacobialgorithmsto bevery fastatfinishing.
In theremainderof this section,we describethreepossibleimplemen-tationsof theseideas.Eachonecorrespondsto a differenttypeof contrastfunction andto different optionsaboutupdating.Section4.1describesadata-basedalgorithmoptimizingφOML(Y); section4.2describesastatistic-basedalgorithmoptimizingφJADE(Y);section4.3presentsamixedapproach
174 Jean-FrancoisCardoso
optimizingφSH(Y); finally, section4.4discussestherelationshipsbetweenthesecontrastfunctions.
4.1 A Data-BasedJacobiAlgorithm: MaxKurt. WestartbyaJacobitech-nique for optimizing the approximation,equation3.9, to the orthogonallikelihood.For thesakeof exposition,we considera simplifiedversionofφOML(Y) obtainedby settingk(S1) = k(S2) = . . . = k(Sn) = k, in whichcasetheminimizationof contrastfunction,equation3.9,isequivalentto theminimizationof
φMK(Y)def= −k
∑
i
QYiiii . (4.2)
Thiscriterionis alsostudiedby MoreauandMacchi(1996),whoproposeatwo-stageadaptiveprocedurefor its optimization;it alsoservesasastart-ing point for introducingtheone-stageadaptivealgorithmof CardosoandLaheld(1996).
DenoteGij (θ) theplanerotationmatrix that rotatesthepair (i, j) by anangleθ asin step2babove.Thensimpletrigonometryyields:
φMK(Gij (θ)Y) = µij − kλij cos(4(θ − Äij )), (4.3)
whereµij doesnotdependonθ andλij isnonnegative.Theprincipaldeter-minationof angleÄij is characterizedby
Äij = 1
4arctan
(4QY
iii j − 4QYijjj , QY
iiii + QYjjjj − 6QY
iijj
), (4.4)
wherearctan(y, x) denotestheangleα ∈ (−π, π ] suchthatcos(α) = x√x2+y2
andsin(α) = y√x2+y2
. If Y is a zero-meansphered vector, expression4.3
furthersimplifiesto
Äij = 1
4arctan
(4E
(Y3
i Yj − YiY3j
), E
((Y2
i − Y2j )
2 − 4Y2i Y
2j
) ). (4.5)
Thecomputationsaregivenin theappendix.It is now immediateto mini-mizeφMK(Y) for eachpair of componentsandfor eitherchoiceof thesignof k. If onelooksfor componentswith positivekurtosis(oftencalledsuper-gaussian),theminimizationof φMK(Y) is identicalto themaximizationofthesumof thekurtosisof thecomponentssincewehavek > 0in thiscase.TheGivensanglesimply is θ = Äij sincethis choicemakesthecosineinequation4.2equalto its maximumvalue.
We referto theJacobialgorithmoutlinedaboveasMaxKurt. A Matlabimplementationis listed in the appendix,whosesimplicity is consistentwith thedata-basedapproach.Note,however, thatit is alsopossibleto use
![Page 10: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole](https://reader031.vdocuments.net/reader031/viewer/2022031007/5b8ba28d09d3f236638bb5b5/html5/thumbnails/10.jpg)
High-OrderContrastsfor IndependentComponentAnalysis 175
thesamecomputationsin astatistic-basedalgorithm.Ratherthanrotatingthedatathemselvesat eachstepby equation4.1,oneinsteadupdatesthesetof all fourth-ordercumulantsaccordingto thetransformationlaw,equa-tion 3.13,with theGivensanglefor eachpairstill givenby equation4.3.Inthiscase,thememoryrequirementis O(n4) for storingall thecumulantsasopposedto nT for storingthedataset.
Thecasek < 0 where, looking for light-tailedcomponents,oneshouldminimizethesumof thekurtosisissimilar.Thisapproachcouldbeextendedtokurtosisof mixedsignsbutthecontrastfunctionthenhaslesssymmetry.Thisis not includedin thisarticle.
4.1.1 Stability. Whatis theeffectof theapproximationof equalkurtosismadeto derivethesimplecontrastφMK(Y)?WhenX = ASwith Sof inde-pendentcomponents,wecanatleastusethestabilityresultof CardosoandLaheld(1996),whichappliesdirectlyto thiscontrast.Definethenormalizedkurtosisasκi = σ−4
i k(Si). ThenB = A−1 is a stablepoint of thealgorithmwith k > 0 if κi + κj > 0for all pairs1 ≤ i < j ≤ n. Thesameconditionalsoholdswith all signsreversedfor componentswith negativekurtosis.
4.2 A Statistic-BasedAlgorithm: JADE. ThissectionoutlinestheJADEalgorithm(Cardoso& Souloumiac,1993),which is specificallya statistic-basedtechnique.Wedonotneedtogointomuchdetailbecausethegeneraltechniquefollowsdirectlyfromtheconsiderationsof section3.3.TheJADEalgorithmcanbesummarizedas:
1. Initialization. EstimateawhiteningmatrixW andsetZ = WX.
2. Formstatistics. Estimateamaximalset{QZi } of cumulantmatrices.
3. Optimizean orthogonalcontrast. Find the rotationmatrix V suchthatthecumulantmatricesareasdiagonalaspossible,that is, solveV =argmin
∑i Off(V†QZ
i V).
4. Separate. EstimateA asA = VW−1 and/orestimatethecomponentsasS= A−1X = V†Z.
This is a Jacobialgorithmbecausethejoint diagonalizerat step3 is foundby a Jacobitechnique.However, theplanerotationsareappliednot to thedata(which are summarizedin the cumulantmatrices)but to the cumu-lantmatricesthemselves;thealgorithmupdatesnotdatabutmatrix-valuedstatisticsof thedata.As with MaxKurt, theGivensangleat eachstepcanbecomputedin closedform evenin thecaseof possiblycomplexmatrices(Cardoso& Souloumiac,1993).Theexplicit expressionfor theGivensan-glesisnotparticularlyenlighteningandisnotreportedhere.(TheinterestedreaderisreferredtoCardoso& Souloumiac,1993,andmayrequestaMatlabimplementationfromtheauthor.)
176 Jean-FrancoisCardoso
A keyissueistheselectionof thecumulantmatricestobeinvolvedin theestimation.As explainedin section3.2,the joint diagonalizationcriterion∑
i Off(V†QZi V) ismadeidenticalto thecontrastfunction,equation3.11,by
usingamaximalsetof cumulantmatrices.This is abit surprisingbutveryfortunate.We do not know of anyotherway for a priori selectingcumu-lant matricesthat would offer sucha property(but seethe nextsection).In anycase,it guaranteesequivariantestimatesbecausethealgorithm,al-thoughoperatingonstatisticsof thesphereddata,alsooptimizesimplicitlya functionof Y = V†Z only.
Before proceeding,we notethat truecumulantmatricescanbeexactlyjointly diagonalizedwhenthemodelholds,but this is no longerthecasewhenweprocessrealdata.First,onlysamplestatisticsareavailable;second,the modelX = AS with independententriesin S cannotbe expectedtohold accuratelyin general.This is anotherreasonthat it is importanttoselectcumulantmatricessuchthat
∑i Off(V†QZ
i V) is a contrastfunction.In thiscase,theimpossibilityof anexactjoint diagonalizationcorrespondsto the impossibilityof finding Y = BX with independententries.Makingamaximalsetof cumulantmatricesasdiagonalaspossiblecoincideswithmakingthe entriesof Y asindependentaspossibleasmeasured by (thesampleversionof) criterion3.11.
Thereareseveraloptionsfor estimatingamaximalsetofcumulantmatri-ces.Recallthatsuchasetisdefinedas{QZ(Mi)|i = 1, n2}where{Mi|i = 1, n2}isanybasisfor then2-dimensionallinearspaceofn×nmatrices.A canonicalbasisfor thisspaceis {epe†
q|1≤ p, q≤ n}, whereep is acolumnvectorwith a1 in pth positionand0’selsewhere.It is readilycheckedthat
[QZ(epe†q)]ij = Cum(Zi, Zj, Zp, Zq). (4.6)
In otherwords,theentriesof thecumulantmatricesfor thecanonicalbasisarejustthecumulantsofZ.A betterchoiceistoconsiderasymmetric/skew-symmetricbasis.DenoteMpq ann×n matrixdefinedasfollows:Mpq = epe†
p
if p = q,Mpq = 2−1/2(epe†q+eqe†
p) if p < qandMpq = 2−1/2(epe†q−eqe†
p) if p > q.ThisisanorthonormalbasisofRn×n.Wenotethatbecauseof thesymmetriesof thecumulantsQZ(epe†
q) = QZ(eqe†p) so that QZ(Mpq) = 2−1/2QZ(epe†
q) if
p < q andQZ(Mpq) = 0 if p > q. It follows that the cumulantmatricesQZ(Mpq) for p > q do not evenneedto be computed.Being identicallyzero, theydo not enterin thejoint diagonalizationcriterion.It is thereforesufficienttoestimateandtodiagonalizen+n(n−1)/2(symmetric)cumulantmatrices.
Thereisanotherideatoreducethesizeofthestatisticsneededtorepresentexhaustivelythe fourth-order information.It is, however, applicableonlywhenthemodelX = ASholds.In thiscase,thecumulantmatricesdohavethestructureshownat equation3.18,andtheir sampleestimatesarecloseto it for largeenoughT. ThenthelinearmappingM → QZ(M) hasrankn
![Page 11: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole](https://reader031.vdocuments.net/reader031/viewer/2022031007/5b8ba28d09d3f236638bb5b5/html5/thumbnails/11.jpg)
High-OrderContrastsfor IndependentComponentAnalysis 177
(moreprecisely, itsrankisequalto thenumberof componentswith nonzerokurtosis)becausetherearen lineardegreesof freedomfor matricesin theform U1U†, namely, then diagonalentriesof 1. From this fact andfromthe symmetriesof the cumulants,it follows that it existsn eigenmatricesE1, . . . , En, which are orthonormal,andsatisfiesQZ(Ei) = µiEi where thescalarµi is thecorrespondingeigenvector. ThesematricesE1, . . . , En spantherangeof themappingM → QZ(M), andanymatrix M orthogonaltothemis in thekernel,thatis,QZ(M) = 0.Thisshowsthatall theinformationcontainedin QZ canbesummarizedby then eigenmatricesassociatedwiththen nonzero eigenvalues.By insertingM = uiu†
i in theexpressions3.18andusingtheorthonormalityof thecolumnsof U (that is, u†
i uj = δij ), it isreadilycheckedthatasetof eigenmatricesis {Ei = uiu†
i }.TheJADEalgorithmwasoriginally introducedasperformingICA by a
joint approximatediagonalizationof eigenmatricesin CardosoandSoulou-miac(1993),whereweadvocatedthejointdiagonalizationofonlythenmostsignificanteigenmatricesofQZ asadevicetoreducethecomputationalload(eventhoughtheeigenmatricesareobtainedattheextracostof theeigende-compositionof ann2× n2 arraycontainingall thefourth-ordercumulants).Thenumberof statisticsis reducedfrom n4 cumulantsor n(n + 1)/2 sym-metriccumulantmatricesofsizen×n toasetofneigenmatricesofsizen×n.Sucha reductionis achievedat nostatisticalloss(at leastfor largeT) onlywhenthemodelholds.Therefore,wedonotrecommendreductiontoeigen-matriceswhenprocessingdatasetsfor whichit isnotclearapriori whetherthemodelX = ASactuallyholdsto goodaccuracy. Westill referto JADEastheprocessof jointly diagonalizingamaximalsetof cumulantmatrices,evenwhenit isnot furtherreducedto then mostsignificanteigenmatrices.It shouldalsobepointedout that thedeviceof truncatingthe full cumu-lantsetby reductionto themostsignificantmatricesis expectedto destroytheequivariancepropertywhenthemodeldoesnothold.Thenextsectionshowshowtheseproblemscanbeovercomein atechniqueborrowingfromboththedata-basedapproachandthestatistic-basedapproach.
4.3 A Mixed Approach: SHIBBS. In theJADEalgorithm,amaximalsetof cumulantmatricesis computedasa way to ensure equivariancefromthejoint diagonalizationof a fixed setof cumulantmatrices.As a benefit,cumulantsare computedonly oncein a singlepassthroughthedataset,andtheJacobiupdatesareperformedonthesestatisticsratherthanonthewholedataset.This is agoodthing for datasetswith a largenumberT ofsamples.Ontheotherhand,estimatingamaximalsetrequiresO(n4T) oper-ations,andits storagerequiresO(n4) memorypositions.Thesefigurescanbecomeprohibitivewhenlookingfor alargenumberofcomponents.In con-trast,gradient-basedtechniqueshaveto storeandupdatenTsamples.Thissectiondescribesa techniquestandingbetweenthetwo extremepositionsrepresentedby theall-statisticapproachandtheall-dataapproach.
178 Jean-FrancoisCardoso
Recallthat an algorithmis equivariantassoonasits operationcanbeexpressedonly in termsof the extractedcomponentsY (Cardoso,1995).Thissuggeststhefollowing technique:
1. Initialization. Selecta fixed setM = {M1, . . . , MP} of n × n matrices.EstimateawhiteningmatrixW andsetY = WX.
2. Estimatearotation. Estimatetheset{QY(Mp)|1≤ p ≤ P} of Pcumulantmatricesandfind ajoint diagonalizerV of it.
3. Update. If V iscloseenoughtotheidentitytransform,stop.Otherwise,rotatethedata:Y ← V†Y andgoto 2.
Suchanalgorithmisequivariantthankstothereestimationof thecumulantsofYafterupdating.It isin somesensedatabasedsincetheupdatingin step3isonthedatathemselves.However, therotationmatrixto beappliedto thedatais computedin step2asin astatistic-basedprocedure.
Whatwouldbeagoodchoicefor thesetM?Thesetof n matricesM ={e1e†
1, . . . , ene†n} seemsa naturalchoice:it is anorderof magnitudesmaller
than the maximalset,which containsO(n2) matrices.The kth cumulantmatrix in sucha setis QY(eke†
k), andits (i, j)th entry is Cum(Yi, Yj, Yk, Yk),which is just ann × n squareblockof cumulantsof Y. Wecall thesetof ncumulantmatricesobtainedin this way whenk is shiftedfrom 1 to n thesetof SHIftedBlocksfor Blind Separation(SHIBBS),andwe usethesamenamefor theICA algorithmthatdeterminestherotationV by aniterativejoint diagonalizationof theSHIBBSset.
Strikinglyenough,thesmallSHIBBSsetguaranteesaperformanceiden-ticaltoJADEwhenthemodelholdsfor thefollowing reason.Considerthefinalstepof thealgorithmwhere Y is closeto S if it holdsthatX = ASwith Sof independentcomponents.ThenthecumulantmatricesQY(epe†
q) arezerofor p 6= qbecauseall thecross-cumulantsof Y arezero.Therefore,theonlynonzerocumulantmatricesusedin themaximalsetof JADEarethosecorre-spondingtoep = eq, i.e.preciselythoseincludedinSHIBBS.ThustheSHIBBSsetactuallytendsto thesetof “significanteigen-matrices”exhibitedin theprevioussection.In this sense,SHIBBSimplementsthe original programof JADE—thejoint diagonalizationof thesignificanteigen-matrices—butitdoessowithout going throughtheestimationof thewholecumulantsetandthroughthecomputationof its eigen-matrices.
Doesthe SHIBBSalgorithmcorrespondto the optimizationof a con-trastfunction?We cannotresortto the equivalenceof JADE andSHIBBSbecauseit is establishedonly whenthe modelholdsandwe are lookingfor a statementindependentof this later fact.Examinationof the joint di-agonalitycriterionfor theSHIBBSsetsuggeststhat theSHIBBStechniquesolvestheproblemof optimizing thecontrastfunctionφSH(Y) definedinequation3.12.As a matterof fact,theconditionfor a givenY to bea fixed
![Page 12: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole](https://reader031.vdocuments.net/reader031/viewer/2022031007/5b8ba28d09d3f236638bb5b5/html5/thumbnails/12.jpg)
High-OrderContrastsfor IndependentComponentAnalysis 179
pointof theSHIBBSalgorithmis thatfor anypair1≤ i < j ≤ n:
∑
k
Cum(Yi, Yj, Yk, Yk) (Cum(Yi, Yi, Yk, Yk)
− Cum(Yj, Yj, Yk, Yk))
= 0, (4.7)
andwe canprovethatthis is alsothestationarityconditionof φSH(Y). Wedonot includetheproofsof thesestatements,whicharepurely technical.
4.4 Comparing Fourth-Order Orthogonal Contrasts. Wehaveconsid-eredtwo approximations,φJADE(Y) andφSH(Y), to theminimummarginalentropy/mutualinformationcontrastφMI (I), which are basedon fourth-ordercumulantsandcanbeoptimizedby Jacobitechnique.Theapproxi-mationφME
24 (Y) proposedby Comonalsobelongsto thiscategory. Onemaywonderabouttherelativestatisticalmeritsof thesethreeapproximations.ThecontrastφME
24 (Y) stemsfromanEdgeworthexpansionfor approximat-ing φME(Y), whichin turnhasbeenshownto derivefromtheML principle(seesection2).SinceML estimationoffers(asymptotic)optimalityproper-ties,onemaybetemptedtoconcludetothesuperiorityofφME
24 (Y).However,this is not thecase,asdiscussednow.
First, when the ICA model holds, it canbe shownthat eventhoughφME
24 (Y) andφJADE(Y) aredifferentcriteria,theyhavethesameasymptoticperformancewhen appliedto samplestatistics(Souloumiac& Cardoso,1991).This is alsotrueof φSH(Y) sincewehaveseenthatit is equivalenttoJADEin thiscase(amorerigorousproof is possible,basedonequation4.7,but is not included).
Second,whentheICA modeldoesnothold,thenotionof identificationaccuracydoesnot makesenseanymore,but onewould certainlyfavor anorthogonalcontrastreachingits minimum at a point ascloseaspossibleto the point where the “true” mutual informationφME(Y) is minimized.However, it seemsdifficult to find asimplecontrast(suchasthoseconsid-ered here) that would be a goodapproximationto φME(Y) for any wideclassof distributionsof X. Notethat theML argumentin favor of φME
24 (Y)
is basedon an Edgeworthexpansionthat is valid for “almost gaussian”distributions—thosedistributionsthatmakeICA very difficult andof du-bioussignificance:In practice,ICA shouldberestrictedto datasetswherethecomponentsshowasignificantamountof nongaussianity, inwhichcasetheEdgeworthexpansionscannotbeexpectedto beaccurate.
ThereisanotherwaythanEdgeworthexpansionfor arrivingatφME24 (Y).
Considercumulantmatching:the matchingof the cumulantsof Y to thecorrespondingcumulantsof ahypotheticalvectorSwith independentcom-ponents.TheorthogonalcontrastfunctionsφJADE(Y), φSH(Y), andφME
24 (Y)
canbeseenasmatchingcriteriabecausetheypenalizethedeviationof thecross-cumulantsof Y from zero (which is thevalueof cross-cumulantsofa vectorS with independentcomponents,indeed),andtheydo sounder
180 Jean-FrancoisCardoso
theconstraintthatY is white—thatis, by enforcing anexactmatchof thesecond-ordercumulantsof Y.
It is possibleto deviseanasymptoticallyoptimalmatchingcriterionbytakingintoaccountthevariabilityof thesampleestimatesof thecumulants.Sucha computationis reportedin Cardosoet al. (1996)for the matchingof all second-andfourth-ordercumulantsof complex-valuedsignals,butasimilarcomputationis possiblefor real-valuedproblems.It showsthattheoptimalweightingof thecross-cumulantsdependson thedistributionsofthecomponentssothat the“flat weighting” of all thecross-cumulants,asin equation3.10,is not thebestonein general.However, in thelimit of “al-mostgaussian”signals,theoptimalweightstendto valuescorrespondingpreciselyto thecontrastK24(Y|S) definedin equation3.5.Thisis notunex-pectedandconfirmsthat thecrudecumulantexpansionusedin derivingequation3.5is sensible,thoughnot optimal for significantlynongaussiancomponents.
It seemsfromthedefinitionsof φJADE(Y), φSH(Y), andφME24 (Y) thatthese
differentcontrastsinvolve different typesof cumulants.This is, however,an illusion becausethe compactdefinitionsgivenabovedo not takeintoaccountthesymmetriesof thecumulants:thesamecross-cumulantmaybecountedseveraltimesin eachof thesecontrasts.Forinstance,thedefinitionof JADEexcludesthecross-cumulantCum(Y1, Y1, Y2, Y3) but includesthecross-cumulantCum(Y1, Y2, Y1, Y3), which is identical.Thus,in order todetermineif anybit of fourth-orderinformationisignoredbyanyparticularcontrast,anonredundantdescriptionshouldbegiven.All thepossiblecross-cumulantscomein four differentpatternsof indices:(ijkl), (iikl), (iijj ), and(ijjj ). Nonredundantexpressionsin termsof thesepatternsarein theform:
φ[Y] = Ca
∑
i<j<k<l
eijkl + Cb
∑
i<k<l
(eiikl + ekkil + ellik )
+ Cc
∑
i<j
eiijj + Cd
∑
i<j
(eijjj + ejjj i
),
where eijkldef= Cum(Yi, Yj, Yk, Yl)
2 andtheCi ’s are numericalconstants.Itremainstocounthowmanytimesauniquecumulantappearsin theredun-dantdefinitionsof the threeapproximationsto mutual informationcon-sideredsofar. We give only the resultof this uninspiringtaskin Table1,whichshowsthatall thecross-cumulantsareactuallyincludedin thethreecontrasts,which thereforediffer only by thedifferentscalarweightsgivento eachparticulartype.It meansthatthethreecontrastsessentiallydo thesamething.Inparticular,whenthenumbernofcomponentsislargeenough,thenumberof cross-cumulantsof type[ijkl] (all indicesdistinct)growsasO(n4), while thenumberof othertypesgrowsasO(n3) atmost.Therefore,the[ijkl] typeoutnumbersall theothertypesfor largen: onemayconjecturetheequivalenceof thethreecontrastsin this limit. Unfortunately, it seems
![Page 13: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole](https://reader031.vdocuments.net/reader031/viewer/2022031007/5b8ba28d09d3f236638bb5b5/html5/thumbnails/13.jpg)
High-OrderContrastsfor IndependentComponentAnalysis 181
Table1: Numberof Timesa Cross-Cumulantof a Given Type Appearsin aGivenContrast.
Constants Ca Cb Cc Cd
Pattern ijkl ii kl i ijj ijjj
ComonICA 24 12 6 4JADE 24 10 4 2SHIBBS 24 12 4 4
difficult to draw more conclusions.For instance,we havementionedtheasymptoticequivalencebetweenComon’scontrastandtheJADEcontrastfor anyn, but it doesnot revealitself directly in theweighttable.
5 A Comparisonon BiomedicalData
The performanceof the algorithmspresentedaboveis illustratedusingthe averagedevent-relatedpotential(ERP)datarecorded and processedbyMakeigandcoworkers.A detailedaccountof theiranalysisis in Makeig,Bell, Jung,andSejnowski(1997).For our comparison,we usethedatasetandthe“logistic ICA” algorithmprovidedwith version3.1of Makeig’sICAtoolbox.3 Thedatasetcontains624datapointsof averagedERPsampledfrom14EEGelectrodes.Theimplementationof thelogisticICA providedinthetoolboxis somewhatintermediatebetweenequation1.1andits off-linecounterpart:H(Y) is averagedthroughsubblocksof thedataset.Thenon-linearfunctionis takento beψ(y) = 2
1+e−y − 1 = tanhy2. This is minusthe
log-derivativeψ(y) = − r′(y)
r(y)of thedensityr(y) = β 1
cosh(y/2)(β is anormal-
izationconstant).Therefore,thismethodmaximizesoverA thelikelihoodofmodelX = ASundertheassumptionsthatShasindependentcomponentswith densitiesequalto β 1
cosh(y/2).
Figure 1 showsthe componentsYJADE producedby JADE (first col-umn) and the componentsYLICA producedby the logistic ICA includedin Makeig’stoolbox,which wasrun with all thedefaultoptions;thethirdcolumnshowsthe differencebetweenthe componentsat the samescale.This directcomparisonis madepossiblewith the following postprocess-ing: thecomponentsYLICA werenormalizedtohaveunit varianceandweresortedby increasingvaluesof kurtosis.ThecomponentsYJADE haveunitvarianceby construction;theyweresortedandtheirsignswerechangedtomatchYLICA . Figure1showsthatYJADE andYLICA essentiallyagreeon9of14components.
3 Availablefromhttp://www.cnl.salk.edu/∼scott/.
182 Jean-FrancoisCardoso
JADE Logistic ICA Difference
Figure1: Thesourcesignalsestimatedby JADEandthelogistic ICA andtheirdifferences.
![Page 14: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole](https://reader031.vdocuments.net/reader031/viewer/2022031007/5b8ba28d09d3f236638bb5b5/html5/thumbnails/14.jpg)
High-OrderContrastsfor IndependentComponentAnalysis 183
Anotherillustrationof this fact is givenby thefirst row of Figure2. Theleft panelshowsthemagnitude|Cij | of theentriesof thetransfermatrix Csuchthat C YLICA = YJADE. This matrix wascomputedafter thepostpro-cessingof thecomponentsdescribedin thepreviousparagraph:it shouldbethe identity matrix if the two methodsagreed,evenonly up to scales,signs,andpermutations.Thefigure showsa strongdiagonalstructure inthe northeastblock while the disagreementbetweenthe two methodsisapparent in thegrayzoneof thesouthwestblock.Theright panelshows
thekurtosisk(YJADEi ) plottedagainstthekurtosisk(YLICA
i ). A keyobserva-tion is that the two methodsdo agreeaboutthemostkurtic components;thesealsoarethecomponentswherethetimestructureis themostvisible.In otherwords,thetwo methodsessentiallyagreewhereveranhumaneyefindsthemostvisiblestructures.Figure2alsoshowstheresultsof SHIBBSandMaxKurt. The transfermatrix C for MaxKurt is seento be more di-agonalthanthetransfermatrix for JADE,while thetransferfor SHIBBSislessdiagonal.Thus,thelogistic ICA andMaxKurt agreemoreon thisdataset.Anotherfigure(notincluded)showsthatJADEandSHIBBSarein verycloseagreementoverall components.
TheseresultsareveryencouragingbecausetheyshowthatvariousICAalgorithmsagreewherevertheyfind structure on this particulardataset.Thisisverymuchinsupportof theICA approachtotheprocessingofsignalsfor which it is not clearthat themodelholds.It leavesopenthequestionof interpretingthedisagreementbetweenthevariouscontrastfunctionsintheswampof thelow kurtosisdomain.
It turns out that the disagreementbetweenthe methodson this datasetis, in our view, an illusion. Considertheeigenvaluesλ1, . . . , λn of thecovariancematrix RX of theobservations.Theyare plottedon a dB scale(this is 10log10λi) in Figure 3. Thetwo leastsignificanteigenvaluesstandratherclearlybelowthestrongestoneswith agapof 5.5dB.Wetakethisasanindicationthatoneshouldlook for 12linearcomponentsin thisdatasetratherthan14,asin thepreviousexperiments.Theresultis ratherstriking:by runningJADEandthelogisticICA onthefirst 12principalcomponents,an excellentagreementis found overall the12extractedcomponents,asseenon Figure4. Thisobservationalsoholdsfor MaxKurt andSHIBBSasshownby Figure5.
Table2liststhenumberof floating-pointoperations(asreturnedbyMat-lab) andtheCPUtime required to run thefour algorithmson a SPARC 2workstation.TheMaxKurt techniquewasclearlythefastesthere;however,it wasapplicableonly becausewewerelookingfor componentswith posi-tive kurtosis.Thesameis truefor theversionof logistic ICA consideredinthis experiment.It is not trueof JADEor SHIBBS,which areconsistentassoonasatmostonesourcehasavanishingkurtosis,regardlessof thesignof the nonzero kurtosis(Cardoso& Souloumiac,1993).The logistic ICArequiredonly about50%more time thanJADE.TheSHIBBSalgorithmis
184 Jean-FrancoisCardoso
JADE
Logi
stic
ICA
Transfer matrix
0
0.2
0.4
0.6
0.8
1
0 5 10
0
5
10
JADE
Logi
stic
ICA
Comparing kurtosis
SHIBBS
Logi
stic
ICA
Transfer matrix
0
0.2
0.4
0.6
0.8
1
0 5 10
0
5
10
SHIBBS
Logi
stic
ICA
Comparing kurtosis
maxkurt
Logi
stic
ICA
Transfer matrix
0
0.2
0.4
0.6
0.8
1
0 5 10
0
5
10
maxkurt
Logi
stic
ICA
Comparing kurtosis
Figure2: (Left column)Absolutevaluesof thecoefficients|Cij | of amatrixrelat-ing thesignalsobtainedby two differentmethods.A perfectagreementwouldbefor C = I: deviationfromdiagonalindicatesadisagreement.Thesignalsaresortedbykurtosis,showingagoodagreementfor highkurtosis.(Rightcolumn)Comparingthe kurtosisof the sourcesestimatedby two different methods.Fromtopto bottom:JADEversuslogisticICA, SHIBBSversuslogisticICA, andmaxkurtversuslogisticICA.
![Page 15: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole](https://reader031.vdocuments.net/reader031/viewer/2022031007/5b8ba28d09d3f236638bb5b5/html5/thumbnails/15.jpg)
High-OrderContrastsfor IndependentComponentAnalysis 185
1 14
−20
−10
0
10
20
Figure 3: Eigenvaluesof the covariancematrix RX of the data in dB (i.e.,10log10(λi)).
Table2: Numberof Floating-PointOperationsandCPUTime.
Flops CPUSecs. Flops CPUSecs.
Method 14Components 12Components
LogisticICA 5.05e+07 3.98 3.51e+07 3.54JADE 4.00e+07 2.55 2.19e+07 1.69SHIBBS 5.61e+07 4.92 2.47e+07 2.35MaxKurt 1.19e+07 1.09 5.91e+06 0.54
slowerthanJADEhere becausethedatasetis not largeenoughto give itanedge.Theseremarksareevenmoremarkedwhencomparingthefiguresobtainedin theextractionof 12components.It shouldbeclearthat thesefiguresdonotprovemuchbecausetheyarerepresentativeof only apartic-ular datasetandof particularimplementationsof thealgorithms,aswellasof thevariousparametersusedfor tuningthealgorithms.However, theydo disprovetheclaimthatalgebraic-cumulantmethodsareof no practicalvalue.
6 Summary and Conclusions
Thedefinitionsof classicentropic contrastsfor ICA canall beunderstoodfromanML perspective.An approximationof theKullback-Leiblerdiver-genceyieldscumulant-basedapproximationsof thesecontrasts.In theor-
186 Jean-FrancoisCardoso
JADE Logistic ICA Difference
Figure4: The12sourcesignalsestimatedbyJADEandalogisticICA outof thefirst 12principalcomponentsof theoriginaldata.
![Page 16: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole](https://reader031.vdocuments.net/reader031/viewer/2022031007/5b8ba28d09d3f236638bb5b5/html5/thumbnails/16.jpg)
High-OrderContrastsfor IndependentComponentAnalysis 187
JADE
Logi
stic
ICA
Transfer matrix
0
0.2
0.4
0.6
0.8
1
0 5 10
0
5
10
JADE
Logi
stic
ICA
Comparing kurtosis
SHIBBS
Logi
stic
ICA
Transfer matrix
0
0.2
0.4
0.6
0.8
1
0 5 10
0
5
10
SHIBBS
Logi
stic
ICA
Comparing kurtosis
maxkurt
Logi
stic
ICA
Transfer matrix
0
0.2
0.4
0.6
0.8
1
0 5 10
0
5
10
maxkurt
Logi
stic
ICA
Comparing kurtosis
Figure5: Samesettingasfor Figure2but theprocessingis restrictedto thefirst12principalcomponents,showingabetteragreementamongall themethods.
188 Jean-FrancoisCardoso
thogonalapproachto ICA wheredecorrelationis enforced,thecumulant-basedcontrastscanbeoptimizedwith Jacobitechniques,operatingoneitherthedataor statisticsof thedata,namely, cumulantmatrices.Thestructureof the cumulantsin the ICA modelcanbe easilyexploitedby algebraicidentificationtechniques,but the simpleversionsof thesetechniquesarenot equivariant.Onepossibility for overcomingthis problemis to exploitthejoint algebraicstructureof severalcumulantmatrices.In particular, theJADEalgorithmbridgesthegapbetweencontrast-basedapproachesandalgebraictechniquesbecausetheJADEobjectiveisbothacontrastfunctionandtheexpressionof theeigenstructureof thecumulants.Moregenerally,thealgebraicnatureof thecumulantscanbeexploitedtoeasetheoptimiza-tion of cumulant-basedcontrastsfunctionsby Jacobitechniques.This canbedonein adata-basedor astatistic-basedmode.Thelatterhasanincreas-ing relativeadvantageasthenumberof availablesamplesincreases,but itbecomesimpracticalfor largenumbersn of componentssincethenumberof fourth-ordercumulantsgrowsasO(n4). Thiscanbeovercometo a cer-tainextentby resortingto SHIBBS,whichiterativelyrecomputesanumberO(n3) of cumulants.
An importantobjectiveof this articlewasto combattheprejudicethatcumulant-basedalgebraicmethodsare impractical.We haveshownthattheycompareverywell tostate-of-the-artimplementationsofadaptivetech-niquesonarealdataset.
Moreextensivecomparisonsremaintobedoneinvolvingvariantsof theideaspresentedhere. A techniquelike JADE is likely to chokeon a verylarge numberof components,but the SHIBBSversionis not asmemorydemanding.Similarly, theMaxKurt methodcanbeextendedto dealwithcomponentswith mixedkurtosissigns.In this respect,it is worth under-lining theanalogybetweentheMaxKurt updateandtherelativegradientupdate,equation1.1,whenfunctionH(·) is in theform of equation1.5.
A commenton tuning the algorithms:In order to codean all-purposeICA algorithmbasedongradientdescent,it is necessaryto deviseasmartlearningschedule.Thisisusuallybasedonheuristicsandrequiresthetuningof someparameters.In contrast,Jacobialgorithmsdonotneedto betunedin theirbasicversions.However,onemaythinkof improvingontheregularJacobisweepthroughall thepairsin prespecifiedorderby devisingmoresophisticatedupdatingschedules.Heuristicswould beneededthen,asinthecaseof gradientdescentmethods.
We concludewith a negativepoint aboutthe fourth-order techniquesdescribedin this article. By nature, they optimizecontrastscorrespond-ing somehowto usinglinear-cubicnonlinearfunctionsin gradient-basedalgorithms.Therefore, they lack the flexibility of adaptingthe activationfunctionsto thedistributionsof theunderlyingcomponentsasonewouldideally do andasis possiblein algorithmslike equation1.1.Evenworse,thisverytypeof nonlinearfunction(linearcubic)hasonemajordrawback:potentialsensitivityto outliers.Thiseffectdid notmanifestitself in theex-
![Page 17: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole](https://reader031.vdocuments.net/reader031/viewer/2022031007/5b8ba28d09d3f236638bb5b5/html5/thumbnails/17.jpg)
High-OrderContrastsfor IndependentComponentAnalysis 189
amplespresentedin thisarticle,but it couldindeedshowup in otherdatasets.
Appendix: Derivation and Implementation of MaxKurt
A.1 GivensAnglesfor MaxKurt. An explicit formof theMaxKurtcon-trastasafunctionof theGivensanglesisderived.Forconciseness,wedenote[ijkl] = QY
ijkl andwedefine
aij = [iiii ] + [ jjjj ] − 6[iijj ]
4bij = [iii j] − [ jjj i] λij =
√a2
ij + b2ij . (A.1)
Thesumof thekurtosisfor thepair of variablesYi andYj aftertheyhavebeenrotatedbyanangleθ dependsonθ asfollows(wherewesetc= cos(θ)
ands= sin(θ)):
k(cos(θ)Yi + sin(θ)Yj) + k(− sin(θ)Yi + cos(θ)Yj) (A.2)
= c4[iiii ] + 4c3s[iii j] + 6c2s2[iijj ] + 4cs3[ijjj ] + s4[ jjjj ] (A.3)
+ s4[iiii ] − 4s3c[iii j] + 6s2c2[iijj ] − 4sc3[ijjj ] + c4[ jjjj ] (A.4)
= (c4 + s4)([iiii ] + [ jjjj ]) + 12c2s2[iijj ] + 4cs(c2 − s2)([iii j] − [ jjj i]) (A.5)
c= −8c2s2[iiii ] + [ jjjj ] − 6[iijj ]
4+ 4cs(c2 − s2)([iii j] − [ jjj i]) (A.6)
= −2sin2(2θ)aij + 2sin(2θ) cos(2θ)bijc= cos(4θ)aij + sin(4θ)bij (A.7)
= λij(cos(4θ) cos(4Äij ) + sin(4θ) sin(4Äij )
)= λij cos(4(θ − Äij )). (A.8)
wheretheangle4Äij is definedby
cos(4Äij ) =aij√
a2ij + b2
ij
sin(4Äij ) =bij√
a2ij + b2
ij
. (A.9)
This is obtainedby usingthemultilinearity andthesymmetriesof thecu-mulantsat linesA.3 andA.4, followedby elementarytrigonometrics.
If Yi andYj are zero-meanandsphered,EYiYj = δij , we have[iiii ] =QY
iiii = EY4i − 3E2Y2
i = EY4i − 3andfor i 6= j: [iii j] = QY
iii j = EY3i Yj aswell as
[iijj ] = QYiijj = EY2
i Y2j − 1.Henceanalternateexpressionfor aij andbij is:
aij = 1
4E
(Y4
i + Y4j − 6Y2
i Y2j
)bij = E
(Y3
i Yj − YiY3j
). (A.10)
It may be interestingto notethat all the momentsrequired to determinetheGivensanglefor agivenpair (i, j) canbeexpressedin termsof thetwo
190 Jean-FrancoisCardoso
variablesξij = YiYj andηij = Y2i − Y2
j . Indeed,it is easilycheckedthatfor azero-meanspheredpair (Yi, Yj), onehas
aij = 1
4E
(η2
ij − 4ξ2ij
)bij = E
(ηijξij
). (A.11)
A.2 A Simple Matlab Implementation of MaxKurt. A Matlab imple-mentationcouldbeasfollows,wherewehavetriedtomaximizereadabilitybutnot thenumericalefficiency:
function Y = maxkurt(X) %[n T] = size(X) ;Y = X - mean(X,2)*ones(1,T); % Remove the meanY = inv(sqrtm(X*X’/T))*Y ; % Sphere the dataencore = 1 ; % Go for first sweepwhile encore, encore=0;
for p=1:n-1, % These two loops gofor q=p+1:n, % through all pairs
xi = Y(p,:).*Y(q,:);eta = Y(p,:).*Y(p,:) - Y(q,:).*Y(q,:);Omega = atan2( 4*(eta*xi’), eta*eta’ - 4*(xi*xi’) );
if abs(Omega) > 0.1/sqrt(T) % A ‘statistically small’% angle
encore = 1 ; % This will not be the%last sweep
c = cos(Omega/4);s = sin(Omega/4);Y([p q],:) = [ c s ; - s c ] * Y([p q],:) ; % Plane
% rotationend
endend
endreturn
Acknowledgments
WeexpressourthankstoScottMakeigandtohiscoworkersfor releasingtothepublicdomainsomeof theircodesandtheERPdataset.Wealsothanktheanonymousreviewerswhoseconstructivecommentsgreatlyhelpedinrevisingthefirst versionof thisarticle.
![Page 18: High-OrderContrasts for Independent Component Analysis · LETTER Communicated by Anthony Bell High-OrderContrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole](https://reader031.vdocuments.net/reader031/viewer/2022031007/5b8ba28d09d3f236638bb5b5/html5/thumbnails/18.jpg)
High-OrderContrastsfor IndependentComponentAnalysis 191
References
Amari, S.-I. (1996).Neural learningin structured parameterspaces—NaturalRiemanniangradient.In Proc.NIPS.
Amari, S.-I., Cichocki, A., & Yang,H. (1996).A new learningalgorithm forblind signalseparation.In Advancesin neuralinformationprocessingsystems,8(pp.757–763).Cambridge,MA: MIT Press.
Bell, A. J.,& Sejnowski,T. J.(1995).An information-maximisationapproachtoblind separationandblind deconvolution.NeuralComputation,7, 1004–1034.
Cardoso,J.-F. (1992).Fourth-order cumulantstructure forcing. Application toblind arrayprocessing.In Proc. 6th SSAPworkshopon statisticalsignal andarrayprocessing(pp.136–139).
Cardoso,J.-F. (1994).Ontheperformanceof orthogonalsourceseparationalgo-rithms.In Proc.EUSIPCO(pp.776–779).Edinburgh.
Cardoso,J.-F. (1995).Theequivariantapproachto sourceseparation.In Proc.NOLTA (pp.55–60).
Cardoso,J.-F. (1997).Infomaxandmaximumlikelihood for sourceseparation.IEEELettersonSignalProcessing,4, 112–114.
Cardoso,J.-F. (1998).Blind signalseparation:statisticalprinciples.Proc.of theIEEE.Specialissueonblind identificationandestimation.
Cardoso,J.-F., Bose,S.,& Friedlander, B. (1996).On optimalsourceseparationbasedonsecondandfourthordercumulants.InProc.IEEEWorkshoponSSAP.Corfu,Greece.
Cardoso,J.-F.,& Laheld,B.(1996).Equivariantadaptivesourceseparation.IEEETrans.onSig.Proc.,44, 3017–3030.
Cardoso,J.-F., & Souloumiac,A. (1993).Blind beamformingfor nonGaussiansignals.IEEEProceedings-F, 140, 362–370.
Comon,P. (1994).Independentcomponentanalysis,anewconcept?SignalPro-cessing,36, 287–314.
Comon,P. (1997).Cumulanttensors.In Proc.IEEESPWorkshoponHOS. Banff,Canada.
Cover, T. M., & Thomas,J.A. (1991).Elementsof informationtheory. New York:Wiley.
DeLathauwer,L.,DeMoor,B.,& Vandewalle,J.(1996).Independentcomponentanalysisbasedon higher-orderstatisticsonly. In Proc.IEEESSAPWorkshop(pp.356–359).
Gaeta,M., & Lacoume,J.L. (1990).Sourceseparationwithout a priori knowl-edge:Themaximumlikelihoodsolution.In Proc.EUSIPCO(pp.621–624).
Golub,G. H., & Van Loan,C. F. (1989).Matrix computations. Baltimore: JohnsHopkinsUniversityPress.
MacKay, D. J. C. (1996). Maximumlikelihood and covariant algorithms for in-dependentcomponentanalysis. In preparation.Unpublishedmanuscript.
Makeig,S.,Bell, A., Jung,T.-P., & Sejnowski,T. J. (1997).Blind separationofauditoryevent-relatedbrainresponsesinto independentcomponents.Proc.Nat.Acad.Sci.USA,94, 10979–10984.
McCullagh,P. (1987).Tensormethodsin statistics. London:ChapmanandHall.
192 Jean-FrancoisCardoso
Moreau,E.,& Macchi,O. (1996).High ordercontrastsfor self-adaptivesourceseparation.InternationalJournalof AdaptiveControl andSignalProcessing,10,19–46.
Moulines,E.,Cardoso,J.-F.,& Gassiat,E.(1997).Maximumlikelihoodfor blindseparationanddeconvolutionofnoisysignalsusingmixturemodels.In Proc.ICASSP’97(pp.3617–3620).
Nadal,J.-P., & Parga,N. (1994).Nonlinearneuronsin the low-noiselimit: Afactorialcodemaximizesinformationtransfer. NETWORK,5, 565–581.
Obradovic,D., & Deco,G. (1997).Unsupervisedlearningfor blind sourcesep-aration:An information-theoreticapproach.In Proc.ICASSP(pp.127–130).
Pearlmutter, B. A., & Parra,L. C. (1996).A context-sensitivegeneralizationofICA. In InternationalConferenceonNeuralInformationProcessing(HongKong).
Pham,D.-T. (1996).Blind separationof instantaneousmixtureof sourcesvia anindependentcomponentanalysis.IEEETrans.onSig.Proc.,44, 2768–2779.
Pham,D.-T., & Garat,P. (1997).Blind separationof mixture of independentsourcesthroughaquasi-maximumlikelihoodapproach.IEEETr.SP,45,1712–1725.
Souloumiac,A.,& Cardoso,J.-F.(1991).Comparaisondemethodesdeseparationdesources.In Proc.GRETSI. JuanlesPins,France(pp.661–664).
ReceivedFebruary7,1998;acceptedJune25,1998.