a regularized kernel cca contrast function for ica

12
Neural Networks 21 (2008) 170–181 www.elsevier.com/locate/neunet 2008 Special Issue A regularized kernel CCA contrast function for ICA Carlos Alzate * , Johan A.K. Suykens Department of Electrical Engineering ESAT-SCD-SISTA, Katholieke Universiteit Leuven, B-3001 Leuven, Belgium Received 27 August 2007; received in revised form 30 November 2007; accepted 28 December 2007 Abstract A new kernel based contrast function for independent component analysis (ICA) is proposed. This criterion corresponds to a regularized correlation measure in high dimensional feature spaces induced by kernels. The formulation is a multivariate extension of the least squares support vector machine (LS-SVM) formulation to kernel canonical correlation analysis (CCA). The regularization is incorporated naturally in the primal problem leading to a dual generalized eigenvalue problem. The smallest generalized eigenvalue is a measure of correlation in the feature space and a measure of independence in the input space. Due to the primal-dual nature of the proposed approach, the measure of independence can also be extended to out-of-sample points which is important for model selection ensuring statistical reliability of the proposed measure. Computational issues due to the large size of the matrices involved in the eigendecomposition are tackled via the incomplete Cholesky factorization. Simulations with toy data, images and speech signals show improved performance on the estimation of independent components compared with existing kernel-based contrast functions. c 2008 Elsevier Ltd. All rights reserved. Keywords: Independent component analysis; Kernel canonical correlation analysis; Least squares support vector machines 1. Introduction Canonical correlation analysis (CCA) finds linear relation- ships between two sets of variables (Hotelling, 1936). The ob- jective is to find basis vectors on which the projected variables are maximally correlated. The CCA problem can be solved by means of a generalized eigenvalue problem. Generalized CCA consists of a generalization of CCA to more than two sets of variables (Kettenring, 1971). Kernel CCA as a nonlinear exten- sion of CCA first maps the data into a high dimensional feature space induced by a kernel and then performs linear CCA. In this way, nonlinear relationships can be found (Akaho, 2001; Bach & Jordan, 2002; Lai & Fyfe, 2000; Melzer, Reiter, & Bischof, 2001). Without regularization, kernel CCA does not characterize the canonical correlation of the variables leading to a generalized eigenvalue problem with all eigenvalues equal to ±1 and ill-conditioning. Several ad-hoc regularization schemes An abbreviated version of some portions of this article appeared in Alzate and Suykens (2007) as part of the IJCNN 2007 Conference Proceedings, published under IEE copyright. * Corresponding author. E-mail addresses: [email protected] (C. Alzate), [email protected] (J.A.K. Suykens). have been proposed in order to obtain useful estimates of the correlation measure (Bach & Jordan, 2002; Gretton, Herbrich, Smola, Bousquet, & Sch ¨ olkopf, 2005; Lai & Fyfe, 2000). Independent component analysis (ICA) corresponds to a class of methods with the objective of recovering underlying latent factors present on the data. The observed variables are linear mixtures of the components which are assumed to be mutually independent (Cardoso, 1999; Comon, 1994; Hyv¨ arinen, Karhunen, & Oja, 2001; Hyv¨ arinen & Pajunen, 1999). Different approaches to ICA have been proposed in the context of neural networks (Jutten & Herault, 1991), tensor-based algorithms (Comon, 1994) and contrast functions (Hyv¨ arinen et al., 2001). There is a wide range of applications of ICA such as exploratory data analysis, signal processing, feature extraction, noise reduction and blind source separation. A link between independence in the input space and kernel canonical correlation using universal kernels was presented in Bach and Jordan (2002). This link is called the F -correlation and establishes that if two variables in the feature space induced by an RBF kernel are completely uncorrelated then the original variables are independent in the input space. In the case of more than two variables, the F -correlation characterizes pairwise independence. Therefore, independent components can be estimated by minimizing 0893-6080/$ - see front matter c 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.neunet.2007.12.047

Upload: carlos-alzate

Post on 21-Jun-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Neural Networks 21 (2008) 170–181www.elsevier.com/locate/neunet

2008 Special Issue

A regularized kernel CCA contrast function for ICAI

Carlos Alzate∗, Johan A.K. Suykens

Department of Electrical Engineering ESAT-SCD-SISTA, Katholieke Universiteit Leuven, B-3001 Leuven, Belgium

Received 27 August 2007; received in revised form 30 November 2007; accepted 28 December 2007

Abstract

A new kernel based contrast function for independent component analysis (ICA) is proposed. This criterion corresponds to a regularizedcorrelation measure in high dimensional feature spaces induced by kernels. The formulation is a multivariate extension of the least squares supportvector machine (LS-SVM) formulation to kernel canonical correlation analysis (CCA). The regularization is incorporated naturally in the primalproblem leading to a dual generalized eigenvalue problem. The smallest generalized eigenvalue is a measure of correlation in the feature spaceand a measure of independence in the input space. Due to the primal-dual nature of the proposed approach, the measure of independence can alsobe extended to out-of-sample points which is important for model selection ensuring statistical reliability of the proposed measure. Computationalissues due to the large size of the matrices involved in the eigendecomposition are tackled via the incomplete Cholesky factorization. Simulationswith toy data, images and speech signals show improved performance on the estimation of independent components compared with existingkernel-based contrast functions.c© 2008 Elsevier Ltd. All rights reserved.

Keywords: Independent component analysis; Kernel canonical correlation analysis; Least squares support vector machines

1. Introduction

Canonical correlation analysis (CCA) finds linear relation-ships between two sets of variables (Hotelling, 1936). The ob-jective is to find basis vectors on which the projected variablesare maximally correlated. The CCA problem can be solved bymeans of a generalized eigenvalue problem. Generalized CCAconsists of a generalization of CCA to more than two sets ofvariables (Kettenring, 1971). Kernel CCA as a nonlinear exten-sion of CCA first maps the data into a high dimensional featurespace induced by a kernel and then performs linear CCA. Inthis way, nonlinear relationships can be found (Akaho, 2001;Bach & Jordan, 2002; Lai & Fyfe, 2000; Melzer, Reiter, &Bischof, 2001). Without regularization, kernel CCA does notcharacterize the canonical correlation of the variables leading toa generalized eigenvalue problem with all eigenvalues equal to±1 and ill-conditioning. Several ad-hoc regularization schemes

I An abbreviated version of some portions of this article appeared in Alzateand Suykens (2007) as part of the IJCNN 2007 Conference Proceedings,published under IEE copyright.∗ Corresponding author.

E-mail addresses: [email protected] (C. Alzate),[email protected] (J.A.K. Suykens).

0893-6080/$ - see front matter c© 2008 Elsevier Ltd. All rights reserved.doi:10.1016/j.neunet.2007.12.047

have been proposed in order to obtain useful estimates of thecorrelation measure (Bach & Jordan, 2002; Gretton, Herbrich,Smola, Bousquet, & Scholkopf, 2005; Lai & Fyfe, 2000).

Independent component analysis (ICA) corresponds to aclass of methods with the objective of recovering underlyinglatent factors present on the data. The observed variablesare linear mixtures of the components which are assumedto be mutually independent (Cardoso, 1999; Comon, 1994;Hyvarinen, Karhunen, & Oja, 2001; Hyvarinen & Pajunen,1999). Different approaches to ICA have been proposedin the context of neural networks (Jutten & Herault,1991), tensor-based algorithms (Comon, 1994) and contrastfunctions (Hyvarinen et al., 2001). There is a wide range ofapplications of ICA such as exploratory data analysis, signalprocessing, feature extraction, noise reduction and blind sourceseparation. A link between independence in the input spaceand kernel canonical correlation using universal kernels waspresented in Bach and Jordan (2002). This link is calledthe F-correlation and establishes that if two variables inthe feature space induced by an RBF kernel are completelyuncorrelated then the original variables are independent inthe input space. In the case of more than two variables, theF-correlation characterizes pairwise independence. Therefore,independent components can be estimated by minimizing

C. Alzate, J.A.K. Suykens / Neural Networks 21 (2008) 170–181 171

a measure of correlation in the feature space. Differentlinks have also been studied in the literature such as theconstrained covariance (COCO) (Gretton et al., 2005), thekernel generalized variance (KGV) (Bach & Jordan, 2002) andthe kernel mutual information (KMI) (Gretton et al., 2005).

In this paper, a kernel regularized correlation measure(KRC) is proposed. The KRC is an extended version ofthe method introduced in Alzate and Suykens (2007). Thismeasure can be used as a contrast function for ICA. Theformulation arises from a clear optimization framework whereregularization is incorporated naturally in the primal problemleading to a generalized eigenvalue problem as the dual.The proposed method is a multivariate extension of the leastsquares support vector machines (LS-SVM) approach to kernelCCA (Suykens, Van Gestel, De Brabanter, De Moor, &Vandewalle, 2002). The KRC can also be extended to out-of-sample data which is important for parameter selection.

This paper is organized as follows: In Section 2, we reviewlinear CCA, kernel CCA and the LS-SVM interpretation ofkernel CCA. In Section 3, we present a multivariate kernelCCA extension to more than two sets of variables. Section 4contains a review of the link between kernel CCA andICA together with the proposed contrast function. Section 5discusses an approximation of the contrast function usingthe incomplete Cholesky decomposition together with analgorithm. In Section 6, we propose a model selection criterion.Section 7 contains the empirical results and in Section 8 we giveconclusions.

2. LS-SVMs and kernel CCA

Least squares support vector machines (LS-SVM) formu-lations to different problems were discussed in Suykens et al.(2002). This class of kernel machines emphasizes primal-dualinterpretations in the context of constrained optimization prob-lems. LS-SVMs have been studied with respect to kernel Fisherdiscriminant analysis, kernel ridge regression, kernel PCA, ker-nel canonical correlation analysis, kernel partial least-squares,recurrent networks and control (Suykens et al., 2002). In thisSection we discuss LS-SVM formulations to kernel CCA whichare relevant for the sequel of the paper.

2.1. Linear CCA

The problem of CCA consists of measuring the linearrelationship between two sets of variables (Gittins, 1985;Hotelling, 1936). Let x (1)

i Ni=1 and x (2)

i Ni=1 denote N

observations of x (1) and x (2). Assume that the observationshave zero mean. Then, the objective of CCA is to findbasis vectors w(1) and w(2) such that the projected variablesw(1)T

x (1) and w(2)Tx (2) are maximally correlated:

maxw(1),w(2)

ρ =w(1)T

C12w(2)√

w(1)T C11w(1)√

w(2)T C22w(2)(1)

where C11 = (1/N )∑N

k=1 x (1)k x (1)T

k , C22 = (1/N )∑N

k=1 x (2)k

x (2)T

k , C12 = (1/N )∑N

k=1 x (1)k x (2)T

k . This is typically

formulated as the optimization problem:

maxw(1),w(2)

w(1)TC12w

(2) (2)

such that

w(1)T

C11w(1)= 1

w(2)TC22w

(2)= 1

(3)

which leads to the generalized eigenvalue problem:[0 C12

C21 0

] [w(1)

w(2)

]= ρ

[C11 0

0 C22

] [w(1)

w(2)

]where ρ is the correlation coefficient. This problem has d1 +

d2 eigenvalues ρ1,−ρ1, . . . , ρp,−ρp, 0, . . . , 0, where p =min(d1, d2).

An alternative scheme is to find the smallest eigenvalue ofthe following generalized eigenvalue problem:[

C11 C12

C21 C22

] [w(1)

w(2)

]= (1+ ρ)

[C11 0

0 C22

] [w(1)

w(2)

]which provides a bounded correlation measure ranging fromzero to one.

2.2. Kernel CCA

A nonlinear extension of CCA using kernels was introducedin Lai and Fyfe (2000) and Bach and Jordan (2002). Thedata are first embedded into a high dimensional feature spaceinduced by a kernel and then linear CCA is applied. This way,nonlinear relations between variables can be found. Considerthe feature maps ϕ(1)(·) : Rd1 → Rn1 and ϕ(2)(·) :

Rd2 → Rn2 to high dimensional feature spaces of dimensionn1 and n2 respectively. The centred feature matrices Φ(1)

c ∈

RN×n1 ,Φ(2)c ∈ RN×n2 become

Φ(1)c = [ϕ

(1)(x (1)1 )T

− µTϕ(1); . . . ;ϕ

(1)(x (1)N )T

− µTϕ(1) ]

Φ(2)c = [ϕ

(2)(x (2)1 )T

− µTϕ(2); . . . ;ϕ

(2)(x (2)N )T

− µTϕ(2) ],

where µϕ(l) = (1/N )∑N

i=1 ϕ(l)(x (l)i ), l = 1, 2. The canonical

correlation in the feature space can then be written as:

maxw

(1)K ,w

(2)K

w(1)T

K Φ(1)T

c Φ(2)c w

(2)K (4)

such that

w(1)T

K Φ(1)T

c Φ(1)c w

(1)K = 1

w(2)T

K Φ(2)T

c Φ(2)c w

(2)K = 1,

the projection vectors w(1)K and w

(2)K are assumed to lie in the

span of the mapped data:

w(1)K = Φ(1)T

c β(1)

w(2)K = Φ(2)T

c β(2).

Applying the kernel trick results in the followingoptimization problem:

maxβ(1),β(2)

β(1)TΩ (1)

c Ω (2)c β(2) (5)

172 C. Alzate, J.A.K. Suykens / Neural Networks 21 (2008) 170–181

such that

β(1)T

Ω (1)c Ω (1)

c β(1)= 1

β(2)TΩ (2)

c Ω (2)c β(2)

= 1

where Ω (l)c denotes the l-th centred kernel matrix Ω (l)

c =

PΩ (l) P , P = IN − (1/N )1N 1TN is the centring matrix, IN

is the N × N identity matrix, 1N is a vector of N ones,Ω (l) is the kernel matrix with i j-entry: Ω (l)

i j = K (l)(x (l)i , x (l)

j )

and K (l)(xi , x j ) = ϕ(l)(xi )T ϕ(l)(x j ) is a Mercer kernel, for

l = 1, 2. The problem in (5) can be written as the generalizedeigenvalue problem:[

0 Ω (1)c Ω (2)

c

Ω (2)c Ω (1)

c 0

] [β(1)

β(2)

]

= ρ

[Ω (1)

c Ω (1)c 0

0 Ω (2)c Ω (2)

c

] [β(1)

β(2)

].

The nonzero solutions to this generalized eigenvalueproblem are always ρ = ±1. This is due to the fact thatthe angle between the subspaces spanned by the columns ofthe kernel matrices Ω (1)

c and Ω (2)c is always equal to zero

independent of the data. Hence, the canonical correlationestimate is always equal to one (Bach & Jordan, 2002). Thisissue can also be interpreted as ill-conditioning since thekernel matrices can be singular or near-singular. Several ad-hoc regularization schemes have been proposed in order toobtain useful estimators of the canonical correlation in highdimensional feature spaces (Bach & Jordan, 2002; Grettonet al., 2005).

The regularized method proposed in Bach and Jordan (2002)consists of finding the smallest eigenvalue of the followingproblem:[

(Ω (1)c +

Nc2 IN )2 Ω (1)

c Ω (2)c

Ω (2)c Ω (1)

c (Ω (2)c +

Nc2 IN )2

] [β(1)

β(2)

]

= ξ

[(Ω (1)

c +Nc2 IN )2 0

0 (Ω (2)c +

Nc2 IN )2

] [β(1)

β(2)

]where c is small positive regularization constant depending ofthe sample size.

2.3. LS-SVM approach to kernel CCA

An LS-SVM formulation to kernel CCA was introducedin Suykens et al. (2002) with the following primal form:

maxw,v,e,r

γ eT r − ν112

eT e − ν212

r T r −12wT w −

12vT v (7)

such that

e = Φ(1)c w

r = Φ(2)c v.

Proposition 1 (Suykens et al., 2002). Given a positive definitekernel function K : Rd

× Rd→ R with K (x, z) = ϕ(x)T ϕ(z)

and regularization constants γ, ν1, ν2 ∈ R+, the dual problem

to (7) is given by the following eigenvalue problem:[0 Ω (2)

c

Ω (1)c 0

] [ξ

τ

]

= λ

[ν1Ω

(1)c + IN 0

0 ν2Ω(2)c + IN

] [ξ

τ

]. (8)

Proof. Consider the Lagrangian of (7):

L(w, v, e, r; ξ, τ ) = γ eT r − ν112

eT e − ν212

r T r −12wT w

−12vT v − ξ T (e − Φ(1)

c w)− τ T (r − Φ(2)c v)

where ξ, τ are Lagrange multiplier vectors. The Karush-Kuhn-Tucker (KKT) conditions for optimality are

∂L∂w= 0 → w = Φ(1)T

c ξ

∂L∂v= 0 → v = Φ(2)T

c τ

∂L∂e= 0 → γ r = ν1e + ξ

∂L∂r= 0 → γ e = ν2r + τ

∂L∂ξ= 0 → e = Φ(1)

c w

∂L∂β= 0 → r = Φ(2)

c v.

Eliminating the primal variables w, v, r, e gives:

γΦ(2)c Φ(2)T

c τ = ν1Φ(1)c Φ(1)T

c ξ + ξ

γΦ(1)c Φ(1)T

c ξ = ν2Φ(2)c Φ(2)T

c τ + τ

defining λ = 1/γ and applying the definition of a positivedefinite kernel function leads to the following set of equations:Ω (2)

c τ = λ(ν1Ω (1)c ξ + ξ)

Ω (1)c ξ = λ(ν2Ω (2)

c τ + τ)

which can be written as the generalized eigenvalue problem in(8).

Remark 1. Note that the primal objective function (7) doesnot have the same expression as the correlation coefficient buttakes the numerator with a positive sign and the denominatorwith a negative sign. Moreover, it can be considered as ageneralization of the problem minw,v ‖e − r‖22 which is knownin CCA (Gittins, 1985).

Remark 2. Note that the regularization is incorporated in theprimal optimization problem in a natural way.

Remark 3. The projections of the mapped training data ontothe i-th eigenvector (also called the score variables) become

z(i)e = Φ(1)

c w = Ω (1)c ξ (i), (9)

z(i)r = Φ(2)

c v = Ω (2)c τ (i), (10)

i = 1, . . . , 2N .

C. Alzate, J.A.K. Suykens / Neural Networks 21 (2008) 170–181 173

Remark 4. The optimal λ?, ξ ?, τ ? can be selected such that thecorrelation coefficient is maximized:

maxi∈1,...,2N

z(i)T

e z(i)r

‖z(i)e ‖2‖z

(i)r ‖2

.

3. Multivariate kernel CCA

In this section, we extend the LS-SVM formulation to kernelCCA to m sets of variables. Given training data x (l)

i , i =1, . . . , N , l = 1, . . . , m, the multivariate kernel CCA problemcan be formulated in the primal as:

maxw(l),e(l)

γ

2

m∑l=1

m∑k 6=l

e(l)Te(k)−

12

m∑l=1

νle(l)T

e(l)

−12

m∑l=1

w(l)Tw(l) (11)

such that e(l)= Φ(l)

c w(l), l = 1, . . . , m

where e(l) is the compact form of the l-th projected variables

e(l)i = w(l)T

(ϕ(l)(x (l)i )− µϕ(l)),

i = 1, . . . , N , l = 1, . . . , m,

the l-th centred feature matrix Φ(l)c ∈ RN×nh is

Φ(l)c =

ϕ(l)(x (l)

1 )T− µT

ϕ(l)

ϕ(l)(x (l)2 )T− µT

ϕ(l)

...

ϕ(l)(x (l)N )T− µT

ϕ(l)

,

γ, νl are regularization parameters, ϕ(l)(·) are the mappings tohigh dimensional spaces and

µϕ(l) = (1/N )

N∑i=1

ϕ(l)(x (l)i ), l = 1, . . . , m.

Lemma 1. Given m positive definite kernel functions K (l):

Rd× Rd

→ R with K (l)(x, z) = ϕ(l)(x)T ϕ(l)(z) andregularization constants γ, νl ∈ R+, l = 1, . . . , m, the dualproblem to (11) is given by the following generalized eigenvalueproblem:

Kα = λRα (12)

with K,R and α defined as follows:

K =

0 Ω (2)

c . . . Ω (m)c

Ω (1)c 0 . . . Ω (m)

c...

.... . .

...

Ω (1)c Ω (2)

c . . . 0

,

R =

R(1) 0 . . . 0

0 R(2) . . . 0...

.... . .

...

0 0 . . . R(m)

, α =

α(1)

α(2)

...

α(m)

,

R(l)= (IN + νlΩ (l)), l = 1, . . . , m and λ = 1/γ .

Proof. The Lagrangian for this constrained optimizationproblem is:

L(w(l), e(l);α(l)) =

γ

2

m∑l=1

m∑k 6=l

e(l)Te(k)−

12

m∑l=1

νle(l)T

e(l)

−12

m∑l=1

w(l)Tw(l)−

m∑l=1

α(l)T(e(l)− Φ(l)

c w(l))

with conditions for optimality given by

∂L∂w(l)

= 0 → w(l)= Φ(l)T

c α(l)

∂L∂e(l)

= 0 → γ∑k 6=l

e(k)= νle

(l)+ α(l)

∂L∂α(l)

= 0 → e(l)= Φ(l)

c w(l),

for l = 1, . . . , m. Eliminating the primal variables w(l), e(l)

gives:

γ

m∑k 6=l

Φ(k)T

c Φ(k)c α(k)

= (νlΦ(l)T

c Φ(l)c + IN )α(l), l = 1, . . . , m

defining λ = 1/γ and applying the definition of a positivedefinite kernel function leads to the following set of equations:

m∑k 6=l

Ω (k)c α(k)

= λ(νlΩ (l)c + IN )α(l), l = 1, . . . , m

which can be written as the generalized eigenvalue problem in(12).

Remark 5. Note that the regularized matrix R differs fromthe schemes proposed in Bach and Jordan (2002), Grettonet al. (2005) and Yamanishi, Vert, Nakaya, and Kanehisa(2003). These schemes start from an unregularized kernelCCA problem and perform regularization afterwards inan ad-hoc manner. As already discussed, the proposedregularization arises naturally from the primal optimizationproblem providing insights and leading to better numericallyconditioned eigenvalue problems.

Remark 6. The score variables for training data can beexpressed as

z(l)i = Φ(l)

c w(l)= Ω (l)α

(l)i ,

where α(l)i is the α(l) corresponding to the i-th eigenvector,

i = 1, . . . , Nm, l = 1, . . . , m.

Remark 7. The optimal λ?, α? can be selected such that thesum of the correlation coefficients is maximized:

maxi∈1,...,Nm

m∑l=1

m∑k 6=l

(z(l)

i

)Tz(k)

i

‖z(l)i ‖2‖z

(k)i ‖2

.

Remark 8. Given new test data x (l)j , j = 1, . . . , Ntest, l =

1, . . . , m the projections z(l)test can be calculated as:

z(l)test = Ω (l)

testα(l)?, (13)

174 C. Alzate, J.A.K. Suykens / Neural Networks 21 (2008) 170–181

where Ω (l)test ∈ RNtest×N is the kernel matrix evaluated using

the test points with j i-entry Ω (l)test, j i = K (l)(x (l)

j , x (l)i ), j =

1, . . . , Ntest, i = 1, . . . , N .

4. ICA and kernel CCA

4.1. The F correlation

Consider the following statistical model:

Y = M S,

where S is a p×N matrix representing N identically distributedobservations of p independent components, Y is the p ×N matrix of mixtures and M is a p × p mixing matrixassumed to be invertible. The ICA problem (Amari, Cichocki,& Yang, 1996; Cardoso, 1999; Comon, 1994; Hyvarinen et al.,2001; Hyvarinen & Pajunen, 1999) consists of estimatingS based on Y . If the distributions of the components arespecified, maximum likelihood can be used to estimate theICA parametric model (Cardoso, 1999). However, in practice,the distributions of the components are usually not known andthe ICA problem is tackled using an estimate of the demixingmatrix W = M−1 allowing the estimation of independentcomponents via S = W Y (Hyvarinen et al., 2001). A typicalapproach to ICA consists of an approximation of the mutualinformation between the components of S (Amari et al., 1996;Cardoso, 1999). However, alternative contrast functions basedon higher-order moments have also been proposed.

The matrix of mixtures Y is often preprocessed by makingits covariance matrix equal to the identity matrix. Thispreprocessing step is called whitening and reduces the numberof parameters to be estimated because the demixing matrix W isthen necessarily orthogonal (Hyvarinen et al., 2001). Therefore,the ICA problem can be reduced to finding an orthogonal matrixW such that the components of S are independent. A measureof independence based on kernel canonical correlation wasintroduced in Bach and Jordan (2002) with the F-correlationfunctional as contrast function. This functional establishes alink between canonical correlation in feature spaces induced byRBF kernels and independence in the original input space andit is defined as:

ρF(

f (1)(x (1)), f (2)(x (2)))

= maxf (1)∈F (1), f (2)∈F (2)

corr(

f (1)(x (1)), f (2)(x (2)))

where f (l)(x (l)) = 〈ϕ(l)(x (l)), f (l)〉, l = 1, 2 is the reproducing

property of RKHS. In the case of RBF kernels, this functionalis zero if and only if x (1) and x (2) are independent (Bach& Jordan, 2002). In the multivariate case, the F-correlationcharacterizes pairwise independence. Different kernel-basedmeasures of independence have been proposed such as theconstrained covariance (COCO) (Gretton et al., 2005), thekernel mutual information (KMI) (Gretton et al., 2005) and thekernel generalized variance (KGV) (Bach & Jordan, 2002).

4.2. The kernel regularized correlation (KRC) contrastfunction

The largest generalized eigenvalue of the problem (12)provides a measure of correlation among the sets of data. Eventhough this measure may not correspond to the correlationcoefficient, it can be used as a contrast function for ICA.Consider the following modification to (12):

(K +R)α = ζRα, (14)

where ζ = 1+λ. The largest eigenvalue of (12) is related to thesmallest eigenvalue ζmin of (14) and ζmin is bounded betweenzero and one. Therefore, the proposed kernel regularizedcorrelation (KRC) contrast function for ICA is defined asfollows for training data:

ρ(Ω (1), . . . ,Ω (m)) = 1− ζmin. (15)

The KRC for test data is the sum of the correlationcoefficients of the projections:

ρ(Ω (1)test, . . . ,Ω

(m)test ) =

m∑l=1

m∑k 6=l

z(l)T

test z(k)test

‖z(l)test‖2‖z

(k)test‖2

. (16)

5. Optimization on the Stiefel manifold

5.1. Computational issues

For real-life problems, the cost of storing (14) andcomputing the smallest eigenvalue can be prohibitive due tothe size of the matrices involved in the eigendecomposition.However, low-rank approximations of the kernel matrices canbe found by means of the incomplete Cholesky decomposition.An N × N kernel matrix Ω can be factorized as Ω = GGT

where G ∈ RN×N is lower triangular. If the eigenvalues ofΩ decay sufficiently rapidly then it is possible to make theapproximation Ω ≈ GGT where G ∈ RN×M such that‖Ω − GGT

‖2 ≤ η and M N . This method is known as theincomplete Cholesky decomposition with symmetric pivotingand was used in Bach and Jordan (2002) and Gretton et al.(2005) to reduce the computational burden of the correspondingeigenvalue problem.

Making use of the incomplete Cholesky decomposition, thecentred kernel matrices in (14) can be approximated as Ω (l)

c ≈

G(l)c G(l)T

c . The SVD of G(l)c is U (l)Λ(l)U (l)T

where U (l) is anN × Ml orthogonal matrix and Λ(l) is an Ml × Ml diagonalmatrix. Therefore we can approximate (14) with

Kα = ζ Rα (17)

where

K =

U (1)Λ(1)

ν U (1)TU (2)Λ(2)U (2)T

. . . U (m)Λ(m)U (m)T

U (1)Λ(1)U (1)TU (1)Λ(2)

ν U (2)T. . . U (m)Λ(m)U (m)T

.

.

....

. . ....

U (1)Λ(1)U (1)TU (2)Λ(2)U (2)T

. . . U (m)Λ(m)ν U (m)T

,

C. Alzate, J.A.K. Suykens / Neural Networks 21 (2008) 170–181 175

R =

U (1)Λ(1)

ν U (1)T0 . . . 0

0 U (1)Λ(2)ν U (2)T

. . . 0...

.

.

.. . .

.

.

.

0 0 . . . U (m)Λ(m)ν U (m)T

,

Λ(l)ν is a diagonal matrix with nn-entry

Λ(l)ν,nn = 1+ νlΛ(l)

nn, n = 1, . . . , Ml .

Substituting α(l)= U (l)T

α(l), i = l, . . . , m and premultiplyingboth sides of (17) by blkdiag(U (1)T

, . . . , U (m)T) leads to the

following reduced problem:

KRα = ζ α (18)

where

KR =

IM1 T (1)

U U (2)

Λ . . . T (1)U U (m)

Λ

T (2)U U (1)

Λ IM2 . . . T (1)U U (m)

Λ...

.... . .

...

T (m)U U (1)

Λ T (m)U U (2)

Λ . . . IMm

,

α =

α(1)

α(2)

...

α(m)

,

T (l)U = T (l)U (l)T

, T (l) is a diagonal matrix T (l)nn = 1/Λ(l)

nn ,

U (l)Λ = U (l)Λ(l), l = 1, . . . , m, n = 1, . . . , Ml . Note that the

size of KR is (∑m

l=1 Ml) × (∑m

l=1 Ml) which is much smallerthan the original m N × m N problem in (14).

5.2. The Stiefel manifold

The Stiefel manifold is described by the set of all m × qorthogonal matrices W (i.e. W T W = Iq ). The geometryof this special manifold was studied in Edelman, Arias, andSmith (1999) together with gradient descent algorithms foroptimization on the manifold. The tangent space correspondsto the space of all matrices A such that W T A is a skew-symmetrical. A geodesic is defined as the shortest path betweentwo points in the manifold and is determined as GW (t) =W exp(tW T A) where W is the starting point, A is the directionand t is the displacement. The direction A can be determinedas A = D − W DT W where D is the derivative of the contrastfunction with respect to W . Gradient descent can then be usedalong the geodesic in the direction A to obtain orthogonalmatrices W such that the contrast function is minimized. Thekernel-based ICA approaches described in Bach and Jordan(2002) and Gretton et al. (2005) apply gradient descent methodswith golden search on the Stiefel manifold in order to obtaindemixing matrices leading to independent components. We alsouse this approach to minimize the KRC. The computation ofthe KRC and the corresponding minimization on the Stiefelmanifold are shown in the following algorithms: 1 and 2:

The computational complexity of the KRC is O((m M)2 N ),where M is the maximal rank considered by the incomplete

Algorithm 1 Kernel Regularized Correlation (KRC)

Input: Mixture set of m sources x (l)i

Ni=1, l = 1, . . . , m, RBF

kernel parameter σ 2, regularization parameter ν, Cholesky errorthreshold η.Output: ρ(Ω (1)

c , . . . ,Ω (m)c ).

1: Compute the Cholesky factor G(l)c ∈ RN×Ml such that

‖Ω (l)c − G(l)

c G(l)T

c ‖2 ≤ η, l = 1, . . . , m.2: Compute the SVD G(l)

c = U (l)Λ(l)U (l)T, l = 1, . . . , m and

build the matrix KR (18).3: Compute the smallest eigenvalue ζmin of the problemKRα = ζ α.

4: Return 1− ζmin.

Algorithm 2 Gradient descent on the Stiefel manifold

Input: Whitened mixture set of m sources x (l)i

Ni=1, l =

1, . . . , m, RBF kernel parameter σ 2, regularization parameterν, Cholesky error threshold η, initial demixing matrix W0.Output: Demixing matrix W .

1: W ← W02: repeat3: Compute D the derivative of the KRC with respect to W

using first-order differences.4: Compute the direction A = D −W DT W .5: Find a displacement t along the geodesic using golden

search in the direction A such that the KRC decreases.6: Update W ← W exp(tW T A)

7: until change of W in two consecutive iterations is less thansome given threshold.

Cholesky decompositions. However, as noted by Bach andJordan (2002), the bottleneck of the computation is theincomplete Cholesky decomposition. Therefore, the empiricalcomputational complexity of the KRC is O(mM2 N ).

6. Model selection

Typically, the free parameters in kernel-based approachesto ICA are set heuristically to some values that show goodperformance on a number of experiments. However, overfittingmay occur due to the fact that the estimation of the correlationcoefficient is done solely on training data especially when fewdata are available. One of the main advantages of our proposedapproach is a clear constrained optimization problem forkernel CCA and the corresponding measure of independence.This allows the computation of projected variables for out-of-sample (test) data points that can be used to select valuesfor the free parameters. The parameters values are chosensuch that the normalized correlation coefficient is minimalfor validation data. This ensures statistical reliability of theestimated canonical correlation.

We assume that the RBF kernel parameters are identical(σ (l))2

= σ 2, l = 1, . . . , m which means considering onefeature map ϕ(l)

= ϕ. The regularization parameters are alsoconsidered to be identical νl = ν and fixed to some value.

176 C. Alzate, J.A.K. Suykens / Neural Networks 21 (2008) 170–181

Fig. 1. Contour plots of the optimization surfaces for several kernel-based contrast functions for ICA. Three sources were mixed using a product of known Jacobirotations. The optimal set of angles is shown as a black dot. Top left: Amari divergence. Top right: Proposed KRC using σ 2

= 5, ν = 2. Bottom left: KCCAcriterion (Bach & Jordan, 2002) using σ 2

= 1, κ = 2 × 10−2. Bottom right: KGV criterion (Bach & Jordan, 2002) using the same set of parameters as KCCA.Note that all contrast functions are smooth and the black dot corresponds to a local minimum.

Given validation data x (l)h , h = 1, . . . , Nval, l = 1, . . . , m,

the following model selection criterion is proposed:

ρ(Ω (1)val , . . . ,Ω

(m)val ) =

m∑l=1

m∑k 6=l

z(l)T

val z(k)val

‖z(l)val‖2‖z

(k)val‖2‖Ωval‖F

(19)

where z(l)val = Ω (l)

valα(l)?, Ω (l)

val is the kernel matrix

evaluated using the validation points with hi-entry Ω (l)val,hi =

K (l)(x (l)h , x (l)

i ), h = 1, . . . , Nval, i = 1, . . . , N and Ωval =

blkdiag(Ω (1)val , . . . ,Ω

(m)val ). Note that, for RBF kernels the

correlation measure decreases as σ 2 increases (Bach & Jordan,2002). Therefore, ‖Ωval‖F in (19) fixes the scaling introducedby σ 2.

7. Empirical results

In this section, we present empirical results. In allexperiments, we assume that the number of sources m isgiven and the incomplete Cholesky parameter was fixed toη = 1 × 10−4. We use the Amari divergence (Amari et al.,1996) to assess the performance of the proposed approachwhen the optimal demixing matrix W is known. This error

measure compares two demixing matrices and is equal to zeroif and only if the matrices represent the same components.Moreover, the Amari divergence is invariant to permutation andscaling which makes it suitable for ICA. We compared fourdifferent existing algorithms for ICA: FastICA (Hyvarinen &Oja, 1997), the KCCA contrast (Bach & Jordan, 2002), theKGV (Bach & Jordan, 2002), Jade (Cardoso, 1999) and theproposed KRC. The kernel-based algorithms were initializedwith the solution of the KGV using a Hermite polynomialof degree d = 3 and width σ = 1.5. This solution doesnot characterize independence but it was shown to be a goodstarting point (Bach & Jordan, 2002). The computation timeof the eigenvalue decomposition involved in the calculationof the KRC for training data was 0.98 s for N = 28,000,m = 3, η = 10−4 on a Pentium 4, 2.8 GHz, 1 GB RAM usingLanczos algorithm in Matlab. In the case of test and validationdata, the computation time increased to 23.56 s because theprojected variables should be calculated in order to computethe correlation coefficient.

7.1. Toy examples—simple signals

In this experiment we used three simple signals: a sinefunction, a sawtooth function and a sinc function. The sample

C. Alzate, J.A.K. Suykens / Neural Networks 21 (2008) 170–181 177

Fig. 2. Effect of the KRC parameters on the optimization surface. Top row: σ 2= 0.1. Middle row: σ 2

= 1. Bottom row: σ 2= 10. Left column: ν = 0.1. Middle

column: ν = 1. Right column: ν = 10. The black dot indicates the optimal set of angles. Note that for most parameter values the black dot corresponds to a localminimum in the KRC.

size was N = 400. The sources were mixed using a product ofknown Jacobi rotations θ3 = −π/3, θ2 = −π/2, θ1 = −π/5.In this way, the optimal demixing matrix is a 3-D rotationmatrix with angles −θ1,−θ2,−θ3. The first angle was fixed toπ/5 and an estimate of the demixing matrix was made usinga grid in the range [0, π] for the remaining two angles. Fig. 1shows the contour plots of the Amari divergence, the KCCAand KGV criteria described in Bach and Jordan (2002) andthe proposed KRC criterion. The KCCA and KGV parameterswere set to the values recommended by the authors. The KRCparameters were set to σ = 5, ν = 2. The black dot showsthe location of the optimal set of angles which correspondsto a local minimum for all criteria. The effect of the KRCparameters σ 2, ν on the optimization surface can be seen inFig. 2. Note that the optimal set of angles corresponds to a localminimum in the KRC for a wide range of parameter values.

7.2. Toy examples—source distributions

This experiment consisted of demixing six different sourcedistributions. The sample size was N = 900. The distributionsinclude a multimodal asymmetrical mixture of two Gaussians,a Student’s t with three degrees of freedom, a uniform

Table 1Mean Amari error (multiplied by 100) of the source distributions experimentafter 20 runs

# sources KCCA KGV KRC FastICA Jade

2 4.02 2.77 2.75 12.92 4.953 7.43 2.89 2.36 9.37 6.984 6.12 3.55 2.19 9.88 8.305 6.39 5.75 4.48 11.03 8.886 8.42 5.57 4.06 9.23 7.49

The KCCA and KGV parameters were set to the values recommended by theauthors (σ 2

= 1, κ = 2 × 10−2) in Bach and Jordan (2002). The KRCparameters were set to σ 2

= 0.5, ν = 1. The FastICA algorithm was set tothe kurtosis-based contrast and the default parameters. In all cases the KRCobtained the smallest Amari error.

distribution, a mixture of a Gaussian and uniform, a betadistribution and a gamma distribution. We performed 20 runswith different random mixing matrices and compute the meanAmari error for the KCCA, the KGV, the FastICA algorithm,Jade and the proposed KRC. As in the previous toy example,the parameters of the KCCA and KGV were set to the defaultvalues. The FastICA parameters were set to the default valuesand the kurtosis-based contrast function was used. Table 1 andFig. 3 show the results. In all cases the KRC obtained the

178 C. Alzate, J.A.K. Suykens / Neural Networks 21 (2008) 170–181

Fig. 3. Amari errors of the source distributions experiment for the KCCA, KGV and the proposed KRC contrast after 20 runs with different mixing matrices. Topleft: three sources. Top right: four sources. Bottom left: five sources. Bottom right: six sources. The KRC performs better with respect to the KCCA and KGV.

Fig. 4. Mean computation times of the source distributions experiment. The KRC compares favourably with respect the KCCA contrast.

smallest Amari error. Mean computation times for the KCCA,the KGV and the KRC are shown in Fig. 4.

7.3. Speech signals

We used three fragments of 6.25 s of speech sampled at8 KHz with 8 bit quantization. We performed 10 runs with

a random window of N = 10,000 consecutive data points.The fragments were combined with a randomly generatedmixing matrix. The KCCA, KGV and the proposed KRC wereinitialized with Jade and the parameters were fixed to thedefaults. Fig. 5 shows the performance and computation times.The KRC performs better than KCCA and KGV in terms of

C. Alzate, J.A.K. Suykens / Neural Networks 21 (2008) 170–181 179

Fig. 5. Speech demixing results after 10 runs with different segments. Left: Amari errors. Right: Computation times. The KRC outperforms the KCCA and KGVin terms of Amari error with comparable computation times with respect to the KGV.

Fig. 6. Image demixing experiment. Model selection curve for a fixed regularization parameter ν = 1.00. The plot shows the proposed contrast function evaluatedon validation data. The obtained tuned RBF kernel parameter is σ 2

= 0.57.

the Amari error with computation time comparable to the KGV.The Jade algorithm provides a fast and efficient way to initializethe KRC. Even though speech signals violate the i.i.d. dataassumption, ICA has been applied to speech demixing withgood results.

7.4. Image demixing

We used three 140 × 200 pixels images from the Berkeleyimage dataset (Martin, Fowlkes, Tal, & Malik, 2001). Themixed images were divided in 4 parts. The upper left part wasused as training data and the lower right part as validation data.The two remaining regions were considered as test data. TheRBF kernel parameter σ 2

= 0.57 was tuned using the proposedcontrast function on validation data (19). Fig. 6 shows themodel selection curve. The regularization parameter was fixed

to ν = 1.00. Fig. 7 shows the images together with the results.Note that the estimated demixing matrix computed for trainingdata generalizes for unseen data. The use of validation pointsto tune the kernel parameter ensures reliability of the proposedmeasure of independence. The size of the matrix involved inthe eigendecomposition was reduced from 84,000 × 84,000 to56 × 56 using the incomplete Cholesky decomposition withη = 1×10−4. The mean Amari errors after 10 runs are 0.34 forthe KCCA, 0.16 for the KGV, 0.30 for the FastICA algorithmand 0.155 for the proposed KRC.

8. Conclusions

A new kernel-based contrast function for ICA is proposed.This function called the kernel regularized correlation (KRC)is a measure of independence in the input space and a

180 C. Alzate, J.A.K. Suykens / Neural Networks 21 (2008) 170–181

Fig. 7. Image demixing experiment. Top row: Original independent components. Middle row: Observed mixtures. The black squares show the regions used fortraining. Validation regions are shown as white squares. All remaining pixels are considered as test points. Bottom row: Estimated independent components usingthe KRC with tuned RBF kernel width σ 2

= 0.57. The regularization parameter was fixed to ν = 1.00. Note that the estimated components are visually appealingwith an Amari error of 0.15.

measure of regularized correlation in the feature space inducedby RBF kernels. This formulation is based on LS-SVMsand follows from a clear constrained optimization frameworkproviding primal-dual insights and leading to a naturallyregularized measure of independence. This approach alsoprovides out-of-sample extensions for test points which isimportant for model selection ensuring statistical reliabilityof the measure for independence. Simulations with toydatasets show improved performance compared to existingICA approaches. The computational complexity of the KRCcompares favourably to similar kernel-based methods suchas the KCCA and the KGV. Experiments with images andspeech using the incomplete Cholesky decomposition to solvea reduced generalized eigenvalue problem also delivered goodresults.

Acknowledgments

This work was supported by grants and projects for theResearch Council K.U. Leuven (GOA-Mefisto 666, GOA-Ambiorics, several Ph.D./Postdocs & fellow grants), theFlemish Government FWO: Ph.D./Postdocs grants, projectsG.0240.99, G.0211.05, G.0407.02, G.0197.02, G.0080.01,G.0141.03, G.0491.03, G.0120.03, G.0452.04, G.0499.04,G.0226.06, G.0302.07, ICCoS, ANMMM; AWI;IWT:Ph.D.grants, GBOU (McKnow) Soft4s, the Belgian FederalGovernment (Belgian Federal Science Policy Office: IUAPV-22; PODO-II (CP/01/40), the EU(FP5-Quprodis, ERNSI,Eureka 2063-Impact;Eureka 2419-FLiTE) and Contracts

Research/Agreements (ISMC/IPCOS, Data4s, TML, Elia,LMS, IPCOS, Mastercard). Johan Suykens is a professor at theK.U. Leuven, Belgium. The scientific responsibility is assumedby its authors.

References

Akaho, S. (2001). A kernel method for canonical correlation analysis.In Proceedings of the international meeting of the psychometric society.Tokyo: Springer-Verlag.

Alzate, C., & Suykens, J. A. K. (2007). ICA through an LS-SVM based kernelCCA measure for independence. In Proceedings of the 2007 internationaljoint conference on neural networks (pp. 2920–2925).

Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithmfor blind signal separation. In D. S. Touretzky, M. C. Mozer, & M. E.Hasselmo (Eds.), Advances in neural information processing systems: Vol.8 (pp. 757–763). Cambridge, MA: MIT Press.

Bach, F. R., & Jordan, M. I. (2002). Kernel independent component analysis.Journal of Machine Learning Research, 3, 1–48.

Cardoso, J. F. (1999). High-order contrasts for independent componentanalysis. Neural Computation, 11(1), 157–192.

Comon, P. (1994). Independent component analysis, a new concept? SignalProcessing, 36, 287–314.

Edelman, A., Arias, T. A., & Smith, S. T. (1999). The geometry of algorithmswith orthogonality constraints. SIAM Journal on Matrix Analysis andApplications, 20(2), 303–353.

Gittins, R. (1985). Canonical analysis—a review with applications in ecology.Berlin: Springer Verlag.

Gretton, A., Herbrich, R., Smola, A. J., Bousquet, O., & Scholkopf, B. (2005).Kernel methods for measuring independence. Journal of Machine LearningResearch, 6, 2075–2129.

Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28,321–377.

C. Alzate, J.A.K. Suykens / Neural Networks 21 (2008) 170–181 181

Hyvarinen, A., Karhunen, J., & Oja, E. (2001). Independent componentanalysis. New York: John Wiley and Sons.

Hyvarinen, A., & Oja, E. (1997). A fast fixed point algorithm for independentcomponent analysis. Neural Computation, 9(7), 1483–1492.

Hyvarinen, A., & Pajunen, P. (1999). Nonlinear independent componentanalysis: Existence and uniqueness results. Neural Networks, 12(3),429–439.

Jutten, C., & Herault, J. (1991). Blind separation of sources, part i: An adaptivealgorithm based on neuromimetic architecture. Signal Processing, 24, 1–10.

Kettenring, J. R. (1971). Canonical analysis of several sets of variables.Biometrika, 58(3), 433–451.

Lai, P. L., & Fyfe, C. (2000). Kernel and nonlinear canonical correlationanalysis. International Journal of Neural Systems, 10(10), 365–377.

Martin, D., Fowlkes, C., Tal, D., & Malik, J. (2001). A database of human

segmented natural images and its application to evaluating segmentationalgorithms and measuring ecological statistics. In Proc. 8th int’l conf.computer vision: (vol. 2) (pp. 416–423).

Melzer, T., Reiter, M., & Bischof, H. (2001). Nonlinear feature extractionusing generalized canonical correlation analysis. In Proceedings of theinternational conference on artificial neural networks. New York: Springer-Verlag.

Suykens, J. A. K., Van Gestel, T., De Brabanter, J., De Moor, B., &Vandewalle, J. (2002). Least squares support vector machines. Singapore:World Scientific.

Yamanishi, Y., Vert, J., Nakaya, A., & Kanehisa, M. (2003). Extraction ofcorrelated gene clusters from multiple genomic data by generalized kernelcanonical correlation analysis. Bioinformatics, 19, 323i–330i.