kernel method: data analysis with positive definite …shimo/class/fukumizu/kernel...kernel method:...

53
Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance on RKHS Kenji Fukumizu The Institute of Statistical Mathematics Graduate University for Advanced Studies / Tokyo Institute of Technology 1 Nov. 17-26, 2010 Intensive Course at Tokyo Institute of Technology

Upload: lehuong

Post on 12-May-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Kernel Method: Data Analysis with PositiveDefinite Kernels

8. Dependence analysis with covariance on RKHS

Kenji Fukumizu

The Institute of Statistical Mathematics Graduate University for Advanced Studies /

Tokyo Institute of Technology

1

Nov. 17-26, 2010Intensive Course at Tokyo Institute of Technology

Page 2: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Outline

1. Covariance operators on RKHS

2. Independence and dependence with kernels

3. Conditional independence with kernels]

4. Kernel dimension reduction

2

Page 3: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

1. Covariance operators on RKHS

2. Independence and dependence with kernels

3. Conditional independence with kernels

4. Kernel dimension reduction

3

Page 4: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Covariance on RKHS(X , Y): random variable taking values on ΩX x ΩY. (HX, kX), (HY , kY): RKHS with kernels on ΩX and ΩY, resp.Assume

Cross-covariance operator:

– Note: a linear map is a (1,1)-tensor.– c.f. Euclidean case

VYX = E[YXT] – E[Y]E[X]T : covariance matrix

4

)])(),([Cov()]([)]([)]()([, YgXfXfEYgEXfYgEfg YX =−=Σfor all YX HgHf ∈∈ ,

( ) ],[, XaYbCovaVb TTYX =

XYXYYX mmXYE ⊗−Φ⊗Φ≡Σ )]()([

XY HH ⊗∈

Proposition

YX HHYX →Σ :

.)],([,)],([ ∞<∞< YYkEXXkE YX

XYYX PPP mm ⊗−=

Page 5: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

– Fact: is Hilbert Schmidt operator.

– Integral expression:

5

YXΣ

∑∑∑∑∞

=

=

=

=

⊗⊗−=Σ=Σ1 1

2

)(1 1

22 ,,i j

jiYXXYi j

iYXjHSYX mmm ψϕϕψ

2

)(YX HHYXXY mmm

⊗⊗−=

( ) ( ) ),()()],([),()( YXdPXfYykEYykyfYX ∫ −=Σ YY

∵) Plug in Proposition. ),( ⋅= ykg Y

Page 6: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Characterization of Independence• Independence and Cross-covariance operator

TheoremIf the product kernel is characteristic on ΩX x ΩY, then

proof)

– c.f. for Gaussian variables

– c.f. Characteristic function

6

OXY =Σ⇔X and Y are independent

OVXY =⇔ i.e. uncorrelated

YX kk

X Y

X Y

][][][ 11)(1 vYY

uXX

vYuXXY eEeEeE −−+− =⇔

⇔=Σ OXY

⇔ (by characteristic assumption)

YXXY PPP mm ⊗=

YXXY PPP ⊗=

Page 7: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

• Intuition: High-order momentsSuppose X and Y are R-valued, and k(x,u) admits the

expansion

W.r.t. basis 1, u, u2, u3, …, the random variables on RKHS are expressed by

7

++++= 333

22211),( uxcuxcxucuxk

~YXΣ

],[],[],[0],[],[],[0],[],[],[0

0000

3323

2323

313

3232

2222

212

331

221

21

XYCovcXYCovccXYCovccXYCovccXYCovcXYCovccXYCovccXYCovccXYCovc

TXcXcXcuXkX ),,,,1(~),()( 33

221 =Φ

TYcYcYcuYkY ),,,,1(~),()( 33

221 =Φ

The operator ΣYX contains all the high-order moments between X and Y.

)exp(),( xuuxk =e.g.)

Page 8: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Estimation of Cross-covariance Operator: i.i.d. sample on X x Y

An estimator of ΣYX is defined by

Theorem

8

)(,),( ,1,1 NN YXYX

∑=

−⋅⊗−⋅=ΣN

iXiYi

NYX mXkmYk

N 1

)( ˆ),(ˆ),(1ˆXY

( ) )(1ˆ )( ∞→=Σ−Σ NNOpHSYXN

YX

Corollary to the -consistency of the empirical mean, because the norm in is equal to the Hilbert-Schmidt norm of the corresponding operator .

NYX HH ⊗

YX HH →

Page 9: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

1. Covariance operators on RKHS

2. Independence and dependence with kernels

3. Conditional independence with RKHS

4. kernel dimension reduction

9

Page 10: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Measuring Dependence

• (In)dependence measure (HSIC, Hilbert-Schmidt Independence Criterion, Gretton et al 2005)

• Empirical dependence measure

and can be used as measures of dependence.

10

2HSYXYXM Σ=

2)()( ˆˆHS

NYX

NYXM Σ=

⇔= 0YXM X Y with kXkY characteristic

Page 11: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

HS-norm of Cross-covariance Operator

• Empirical estimatorGram matrix expression

HS-norm can be evaluated only in the subspaces and .

11

∑∑==

−=Σ=N

kjikiji

N

jijijiHS

NYX

NYX YYkXXk

NYYkXXk

NM

1,,3

1,2

2)()( ),(),(2),(),(1ˆˆYXYX

∑∑==

+N

kk

N

jiji YYkXXk

N 1,1,4 ),(),(1

YX

Ni

NXi mXk 1

)(ˆ),(Span =−⋅X )(ˆ),(Span NYi mYk −⋅Y

[ ]YXN

YX GGN

M Tr1ˆ2

)( =

,NXNX QKQG =where TNNNN N

IQ 111

−=

Or equivalently,

Page 12: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Normalized Covariance Operator

• Normalized Cross-Covariance Operator

NOCCO

• Characterization of independence

Assume WXY etc. are Hilbert-Schmidt.– Dependence measure

12

1/ 2 1/ 2YX YY YX XXW − −= Σ Σ Σ

YXW O= X Y⇔

With characteristic kernels,

2NOCCO YX HSW=

Page 13: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Kernel-free Integral Expression

Assume PXY have density

are characteristic. WYX is Hilbert-Schmidt.

Then,

13

),( yxpXY

YX HH ⊗

- Kernel-free expression, though the definitions are given by kernels!

- The RHS is χ2-divergence (mean square contingency), which is a well-known dependence measure

2|||| HSYXW dxdyypxpypxp

yxpYX

YX

XY )()(1)()(

),(2

∫∫

−=

Theorem (Fukumizu et al. NIPS 21, 2008)

Page 14: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Empirical Estimator

– Empirical estimation is straightforward with the empirical cross-covariance operator .

– Inversion regularization:

– Replace the covariances in by the empirical ones given by the data ΦX(X1),…, ΦX(XN) and ΦY(Y1),…, ΦY(YN)

– NOCCOemp gives a new kernel estimator for the χ2-divergence. Consistency is known.

14

( ) 1X X X N NR G G N Iε −≡ +

[ ]NOCCO Tremp X YR R=

2/12/1 −− ΣΣΣ= XXYXYYYXW

(dependence measure)

( ) 11 ( )ˆ NXX XX Iε

−−Σ → Σ +

where

( )ˆ NYXΣ

( ) ( )TNNNNX

TNNNNX IKIG 11 11 11 −−= ( )N

jijiX XXkK1,

),(=

=

Page 15: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Independence Test with Kernels I

• Independence test with positive definite kernels

– Null hypothesis H0: X and Y are independent– Alternative H1: X and Y are not independent

and NOCCOemp can be used for test statistics.

15

∑∑==

−=Σ=N

kjikiji

N

jijijiHS

NYX

NYX YYkXXk

NYYkXXk

NM

1,,3

1,2

2)()( ),(),(2),(),(1ˆˆYXYX

∑∑==

+N

kk

N

jiji YYkXXk

N 1,1,4 ),(),(1

YX

Page 16: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Independence test with kernels II• Asymptotic distribution under null-hypothesis

Theorem (Gretton et al. 2008)If X and Y are independent, then

where

– The proof is standard by the theory of degenerate U (or V)-statistics (see e.g. Serfling 1980, Chapter 5).

16

∑∞

=⇒

1

2)(ˆi

iiN

YX ZMN λ in law )( ∞→N

is the eigenvalues of the following integral operator ∞=1iiλ

Zi : i.i.d. ~ N(0,1),

∑ +−= ),,,( ,,,,,,!41 2),,,( dcba dcbacabababadcba kkkkkkUUUUh YXYXYX

Ua = (Xa, Ya)

)()(),,,( aiiUUUbidcba udPdPdPuuuuuhdcb

ϕλϕ =∫

),,(, baba XXkk XX =

Page 17: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Independence test with kernels III• Consistency of test

Theorem (Gretton et al. 2008)If MYX is not zero, then

where

17

( ) ),0(ˆ 2)( σNMMN YXN

YX ⇒− in law )( ∞→N

[ ]( )YXdcbadcba MUUUUhEE −= 2,,

2 ],,,([16σ

Page 18: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Choice of Kernel

• How to choose a kernel?– No definitive solutions have been proposed yet. – For statistical tests, comparison of power or efficiency will be

desirable. – Other suggestions:

• Make a relevant supervised problem, and use cross-validation.• Some heuristics

– Heuristics for Gaussian kernels (Gretton et al 2007)

– Speed of asymptotic convergence (Fukumizu et al. 2008)

Compare the bootstrapped variance and the theoretical one, and choose the parameter to give the minimum discrepancy.

18

jiXX ji ≠− |σ = median

2 2( )lim 2Nemp XX YYHS HSN

Var N HSIC→∞

× = Σ Σ under independence

Page 19: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Application to Independence Test• Toy example

19

-3 -2 -1 0 1 2 3-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3-3

-2

-1

0

1

2

3

X1

Y1

X2

Y2

X3

Y3

independent independentdependent

θ = 0 θ = π/4 θ = π/2

They are all uncorrelated, but dependent for 0 < θ < π/2

Page 20: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

20

Angle 0.0 4.5 9.0 13.5 18.0 22.5HSIC (Median) 93 92 63 5 0 0HSIC (Asymp. Var.) 93 44 1 0 0 0NOCCO (ε = 104, Median) 94 23 0 0 0 0NOCCO (ε = 106, Median) 92 20 1 0 0 0NOCCO (ε = 108, Median) 93 15 0 0 0 0NOCCO (Asymp. Var.) 94 11 0 0 0 0MI (#NN = 1) 93 62 11 0 0 0MI (#NN = 3) 96 43 0 0 0 0MI (#NN = 5) 97 49 0 0 0 0Power Diverg. (#Bins=3) 96 92 43 9 1 0Power Diverg. (#Bins=4) 98 29 0 0 0 0Power Diverg. (#Bins=5) 94 60 2 0 0 0

indep. more dependent

N = 200. Permutation test is used for independence test except contingency table.

# acceptance of independence out of 100 tests (significance level = 5%)MI: mutual information estimated by the nearest neighbor method.

Page 21: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

• Power Divergence (Ku&Fine05, Read&Cressie)– Make partition : Each dimension is divided into q

parts so that each bin contains almost the same number of data.

– Power-divergence

– Null distribution under independence

21

JjjA ∈

∑ ∏∈ =

+==

Jj

N

k

kjjjN k

pppNmXIT 1ˆˆˆ)2(

2),(21

)(λ

λ

λλ

: frequency in Aj: marginal freq. in r-th interval

21−+−

⇒ NqNqN NT χ

jp)(ˆ k

rpI2 = Mean Square Conting.I0 = MI

Page 22: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Independent Test on Text– Data: Official records of Canadian Parliament in English and

French. • Dependent data: 5 line-long parts from English texts

and their French translations. • Independent data: 5 line-long parts from English texts

and random 5 line-parts from French texts. – Kernel: Bag-of-words and spectral kernel

22

Results of permutations test with HS measureTopic Match BOW(N=10) Spec(N=10) BOW(N=50) Spec(N=50)

Agri- Random 0.94 0.95 0.93 0.95culture Same 0.18 0.00 0.00 0.00

Fishery Random 0.94 0.94 0.93 0.95Same 0.20 0.00 0.00 0.00

Immig- Random 0.96 0.91 0.94 0.95ration Same 0.09 0.00 0.00 0.00

(Gretton et al. 2007)Acceptance rate (α = 5%)

Page 23: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

23

Independence Test: Comparison– Brownian distance covariance (Székely & Rizzo AOAS 2010)

• Independence with the characteristic functions

• Independence measure with weighted integral:

• With a clever choice of the weight w, the integral is reduced to HSIC-like measure with

∫ − ξωξωξφωφξωφ ddwYXXY ),()()(),( 2

2121 ),( xxxxk −=

YXXY φφφ =X Y ⇔

],[)( 1 ωωφTX

X eE −= ],[)( 1 ξξφTY

Y eE −= ].[),( )(1 ξωξωφTT YX

XY eE +−=

Page 24: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

24

angle : 0 π/12 π/6 π/4dX = dY= 2 HS 0.94 0.77 0.48 0.42N =128 SR 0.95 0.83 0.66 0.65dX = dY= 2 HS 0.92 0.47 0.17 0.12N = 512 SR 0.93 0.49 0.38 0.33dX = dY= 4 HS 0.92 0.60 0.35 0.23N = 1024 SR 0.93 0.68 0.48 0.47dX = dY= 4 HS 0.92 0.44 0.15 0.12N = 2048 SR 0.94 0.46 0.29 0.27

indep. more dependent

HS: Hilbert-Schmidt norm.Gaussian kernelσ = med||Xi – Xi||

SR: Székely & Rizzo

(Gretton, F, Sriperumbudur. 2010. AOAS Discussion)

% of acceptance of indep. in permutation tests (α = 5%).

Page 25: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Application: ICA

• Independent Component Analysis (ICA)– Assumption

• m independent source signals• m observations of linearly mixed signals

– Problem

• Restore the independent signals S from observations X.

25

A

s1(t)

s2(t)

s3(t)x3(t)

x2(t)

x1(t)

A: m x m invertiblematrix

)()( tAStX =

BXS =ˆ B: m x m orthogonal matrix

Page 26: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

• ICA with HS independence measure

Pairwise-independence criterion is applicable.

Objective function is non-convex. Optimization is not easy. Approximate Newton method has been proposed

Fast Kernel ICA (FastKICA, Shen et al 07)

• Other methods for ICASee, for example, Hyvärinen et al. (2001).

26

∑ ∑= >

=m

a abba YYMBL

1),(ˆ)( BXY =

)()1( ,..., NXX : i.i.d. observation (m-dimensional)

Minimize

(Software downloadable at Arthur Gretton’s homepage)

Page 27: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

• Experiments (speech signal)

27

A

s1(t)

s2(t)

s3(t)x3(t)

x2(t)

x1(t)

randomlygenerated

B

Fast KICA

Three speech signals

Page 28: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

1. Covariance operators on RKHS

2. Independence and dependence with kernels

3. Conditional independence with kernels

4. Kernel dimension reduction

28

Page 29: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Re: Statistics on RKHS

• Linear statistics on RKHS

– Basic statistics Basic statisticson Euclidean space on RKHS

Mean Kernel meanCovariance Cross-covariance operatorConditional covariance Cond. cross-covariance operator

– Plan: define the basic statistics on RKHS and derive nonlinear/ nonparametric statistical methods in the original space. 29

Ω (original space)Φ

feature map H (RKHS)

XΦ (X) = k( , X)

YXΣ

Page 30: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Conditional Independence

• DefinitionX, Y, Z: random variables with joint p.d.f. X and Y are conditionally independent given Z, if

• Applications– Graphical model– Causal inference, etc. 30

)|(),|( || zypxzyp ZYZXY =

)|()|()|,( ||| zypzxpzyxp ZYZXZXY =or

),,( zyxpXYZ

YX Z

With Z known, the information of Xis unnecessary for the inference on Y

(A)

(B)

YX

Z(B)

(A)

Page 31: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Conditional Independence for Gaussian Variables

• Two characterizationsX,Y,Z are Gaussian.

– Conditional covariance

– Comparison of conditional variance

31

X Y | Z OV ZXY =⇔ | i.e. OVVVV ZXZZYZYX =− −1

X Y | Z ZYYZXYY VV |],[| =⇔

Page 32: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Linear Regression and Conditional Covariance• Review: linear regression

– X, Y: random vector (not necessarily Gaussian) of dim p and q.

– Linear regression: predict Y using the linear combination of X.Minimize the mean square error:

– The residual error is given by the conditional covariance matrix.

– For Gaussian variables,

32

][~],[~ YEYYXEXX −=−=

2

matrix:

~~min XAYEpqA

−×

[ ]XYYpqAVXAYE |

2

matrix:Tr~~min =−

×

ZYYZXYY VV |],[| =

“If Z is known, X is not necessary for linear prediction of Y.”can be interpreted as

X Y | Z⇔

Page 33: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Review: Conditional Covariance

• Conditional covariance of Gaussian variables– Jointly Gaussian variable

: m ( = p + q) dimensional Gaussian variable

– Conditional probability of Y given X is again Gaussian

33

),,(),,,( 11 qp YYYXXX ==),( YXZ =

),(~ VNZ µ

=

=

YYYX

XYXX

Y

X

VVVV

V,µµ

µ

)(]|[ 1| XXXYXYXY xVVxXYE µµµ −+==≡ −

XYXXYXYYXYY VVVVxXYVarV 1| ]|[ −−==≡

),(~ || XYYXY VN µ

Cond. mean

Cond. covariance

Note: VYY|X does not depend on x

Schur complement of VXX in V

Page 34: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Conditional Covariance on RKHS

• Conditional Cross-covariance operatorX, Y, Z : random variables on ΩX, ΩY, ΩZ (resp.).(HX, kX), (HY , kY), (HZ , kZ) : RKHS defined on ΩX, ΩY, ΩZ (resp.).

– Conditional cross-covariance operator

– Conditional covariance operator

– may not exist as a bounded operator. But, we can justify the definitions.

34

ZXZZYZYXZYX ΣΣΣ−Σ≡Σ −1| YX HH →:

ZYZZYZYYZYY ΣΣΣ−Σ≡Σ −1| YY HH →:

1−ΣZZ

Page 35: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

• Decomposition of covariance operator

WYX is the ‘correlation’ operator.is defined by the eigendecomposition.

• Rigorous definitions

35

2/12/1XXYXYYYX W ΣΣ=Σ

2/12/1| XXZXYZYYYXZYX WW ΣΣ−Σ≡Σ

such that WYX is a bounded operator with and

.)()(,)()( XXYXYYYX RangeWKerRangeWRange Σ⊥Σ=

1|||| ≤YXW

2/12/1| YYZYYZYYYYZYY WW ΣΣ−Σ≡Σ

2/1XXΣ

Page 36: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Conditional Covariance

• Conditional covariance is expressed by operatorsProposition (FBJ 2004, 2008+)

Assume kZ is characteristic.

In particular,

Proof omitted. Analogy to Gaussian variables:

36

[ ]]|)(),([Cov, | ZXfYgEfg ZYX =Σ

[ ]]|)([Var, | ZYgEgg ZYY =Σ

),( YX HgHf ∈∈∀

)( YHg ∈∀

]|,[Cov)( 1 ZXaYbaVVVVb TTZXZZYZYX

T =− −

]|[Var)( 1 ZYbbVVVVb TZYZZYZYY

T =− −

Page 37: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Mean Square Error InterpretationProposition (FBJ 2004, 2009)

Assume kZ is characteristic.

c.f. for Gaussian variables

37

[ ] 2| )(~)(~inf]|)([, ZfYgEZYgVarEgg

ZHfZYY −==Σ∈

)].([)()(~)],([)()(~ YgEYgYgXfEXfXf −=−=

)( YHg ∈∀

where

2|

~~min]|[ ZaYbZYbVarbVb TT

a

TZYY

T −==

Page 38: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

38

– Proof (left = right)

( ) ( )2)]([)()]([)( ZfEZfYgEYgE −−−

gggfff YYZYZZ Σ+Σ−Σ= ,,2,22/12/12/122/1 ,2 ggWff YYYYZYZZZZ Σ+ΣΣ−Σ=

22/122/122/12/1 gWggWf YYZYYYYYZYZZ Σ−Σ+Σ−Σ=

This part can be arbitrary small by choosing fbecause of

( )gWWggWf YYZYYZYYYYYYZYZZ2/12/122/12/1 , ΣΣ−Σ+Σ−Σ=

ZYY |Σ

.)()( ZZZY RangeWRange Σ=

Page 39: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Conditional Independence with Kernels

Theorem (FBJ2004, 2008+)Assume kZ and kX kY kZ are characteristic.

Assume kZ , kY , kX kZ are characteristic.

– c.f. Gaussian variables

39

),( ZYY =whereX Y | Z

X Y | Z OV ZXY =⇔ |

ZYYZXYY VV |],[| =⇔X Y | Z

ZYYZXYY |][| Σ=ΣX Y | Z ⇔

OZXY =Σ⇔ |

Page 40: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

– Intuition of the condition

If we already know X, the mean square error in predicting Ydoes not decrease, if information Z is added.

In general,

40

ZYYZXYY |][| Σ=Σ

]|)([],|)([ XYgZXYg VarVar =

.|][| ZYYZXYY Σ≤Σ

Page 41: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Empirical Estimator of Conditional Covariance Operator

(X1, Y1, Z1), ... , (XN, YN, ZN)

– Empirical conditional covariance operator

– Estimator of Hilbert-Schmidt norm

41

( ) )(1)()()()(|

ˆˆˆˆ:ˆ NZXN

NZZ

NYZ

NYX

NZYX I Σ+ΣΣ−Σ=Σ

−ε

)(ˆ NYZYZ Σ→Σ etc.

( ) 1)(1 ˆ −− +Σ→Σ INN

ZZZZ ε regularization for inversion

finite rank operators

[ ]ZYZXHSN

ZYX SGSGTrˆ 2)(| =Σ

,NXNX QKQG = TNNNN N

IQ 111

−= centered Gram matrix

( ) 111)( −− +=+−= ZNNZNNZNZ GIGINGISNεε

Page 42: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Consistency

• Consistency on conditional covariance operator

Theorem (FBJ08, Sun et al. 07)Assume and

In particular,

42

)(0ˆ|

)(| ∞→→Σ−Σ N

HSZYXN

ZYX

0→Nε ∞→NNε

)(ˆ|

)(| ∞→Σ→Σ N

HSZYXHSN

ZYX

Page 43: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Applications of Conditional Independence

• Conditional independence test (Fukumizu et al. 2008)

– Estimation of graphical model by data.

• Causality– Causal relations among variables can be formulated in terms

of conditional independence or Markov network. (Sun et al 2007)

– Granger causality for time series.(Xt) is not a cause of (Yt) if

• And more43

),...,|(),...,,,...,|( 111 ptttpttpttt YYYpXXYYYp −−−−−− =

ptt XX −− ,...,1 ptt YY −− ,...,| 1tY

Page 44: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

1. Covariance operators on RKHS

2. Independence and dependence with kernels

3. Conditional independence with kernels

4. Kernel dimension reduction

44

Page 45: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

4545

– Regression: Y : response variable, X=(X1,...,Xm): m-dim. explanatory variable

– Goal of dimension reduction for regression= Find an effective direction for regression (EDR space)

( ))|(~),...,|(~)|( 1 XBYpXbXbYpXYp TTd

T ==

B=(b1,..,bd): matrixdm× d is fixed.

Dimension Reduction for Regression

X Y | BTX

XU V

Y

U = BTX

Page 46: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

4646

Kernel Dimension Reduction(Fukumizu, Bach, Jordan JMLR 2004, AS 2009)

Use characteristic kernels for BTX and Y.

– KDR objective function

– KDR empirical objective function

|| T YY XYY B XΣ = Σ ⇔

|| T YY XYY B XΣ ≥ Σ

X Y | BTX

|:min Tr TT

dYY B XB B B I=

Σ

EDR space

( )[ ]1

:Trmin −

=+ NNXBYIBBB

INGG T

dT

ε

∞=1 iiψ : ONB of HY

( ) ( )∑∞

==−−−

1

2

:)]([)()]([)(infmin

i

TTiiIBBB

XBfEXBfYEYEd

Tψψ

Equivalently

Page 47: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

4747

• Wide applicability of KDR– The most general approach to dimension reduction:

• no strong model is used for p(Y|X) or p(X) . • no strong assumptions on the distribution of X, Y and

dimensionality/type of Y. – Most conventional methods have some restrictions, such as

the elliptic assumption for p(X) for SIR.

• Computational issues– Non-convex objective function, possibly local minima. Gradient method with an annealing technique

starting from a large σ in Gaussian RBF kernel. – Computational cost with matrices of sample size. Low-rank approximation.

KDR method

Page 48: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

4848

Consistency of KDR

Suppose kd is bounded and continuous, and

Let S0 be the set of the optimal parameters;

Estimator:

Then, under some conditions, for any open set

1/ 20, ( ).N NN Nε ε→ → ∞ → ∞

Theorem (FBJ2009)

0U S⊃

( )( )ˆPr 1 ( ).NB U N∈ → → ∞

[ ] [ ] XBYYBXBYYdT

TTIBBBS '|'|0 TrminTr,| Σ=Σ==

( )[ ]1

:

)( Trminˆ −

=+= NNXBYIBBB

N INGGB T

dT

ε

Page 49: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

4949

Numerical Results with KDR

• Synthetic data

.)1()5.1(5.0

222

2

1 WXXXY +++

++=

),0(~ : 4. INX dim 4

).,0(~ 2τNW τ = 0.1, 0.4, 0.8.

KDR SIR SAVE pHdτ Mean SD Mean SD Mean SD Mean SD0.1 0.11 0.07 0.55 0.28 0.77 0.35 1.04 0.340.4 0.17 0.09 0.60 0.27 0.82 0.34 1.03 0.330.8 0.34 0.22 0.69 0.25 0.94 0.35 1.06 0.33

Frobenius norms between the estimator and the true one over 100 samples (means and standard deviations)

Sample size N = 100

±±±

±±±

±±±

±±±

Page 50: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

5050

• Wine data13 dim. 178 data.Y = 3 class label2 dim. projection

-20 -10 0 10 20

-20

-15

-10

-5

0

5

10

15

20

-20 -10 0 10 20

-20

-15

-10

-5

0

5

10

15

20

-20 -10 0 10 20

-20

-15

-10

-5

0

5

10

15

20 CCA

Partial Least Square

Sliced Inverse Regression

-20 -10 0 10 20-20

-15

-10

-5

0

5

10

15

20 KDR

( )1 2

2 21 2

( , )

exp

k z z

z z σ= − −

Page 51: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Summary• Dependence analysis with RKHS

– Covariance and conditional covariance on RKHS can capture the (in)dependence and conditional (in)dependence of random variables.

– Easy estimators can be obtained for the Hilbert-Schmidt norm of the operators.

– If the normalized covariance is used, the Hilbert-Schmidt norm is independent of kernel (χ2-divergence), assuming it is characteristic.

– Statistical tests of independence and conditional independence are possible with kernel measures.

– Applications: dimension reduction for regression (FBJ04, FBJ09), causal inference (Sun et al. 2007).

51

Page 52: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

References

Fukumizu, K. Francis R. Bach and M. Jordan. Kernel dimension reduction in regression. The Annals of Statistics. 37(4), pp.1871-1905 (2009).

Fukumizu, K., A. Gretton, X. Sun, and B. Schölkopf: Kernel Measures of Conditional Dependence. Advances in Neural Information Processing Systems 21, 489-496, MIT Press (2008).

Fukumizu, K., Bach, F.R., and Jordan, M.I. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research. 5(Jan):73-99, 2004.

Gretton, A., K. Fukumizu, C.H. Teo, L. Song, B. Schölkopf, and Alex Smola. A Kernel Statistical Test of Independence. Advances in Neural Information Processing Systems 20. 585-592. MIT Press 2008.

Gretton, A., K. M. Borgwardt, M. Rasch, B. Schölkopf and A. Smola: A Kernel Method for the Two-Sample-Problem. Advances in Neural Information Processing Systems 19, 513-520. 2007.

Gretton, A., O. Bousquet, A. Smola and B. Schölkopf. Measuring Statistical Dependence with Hilbert-Schmidt Norms. Proc. Algorithmic Learning Theory (ALT2005), 63-78. 2005.

52

Page 53: Kernel Method: Data Analysis with Positive Definite …shimo/class/fukumizu/Kernel...Kernel Method: Data Analysis with Positive Definite Kernels 8. Dependence analysis with covariance

Shen, H., S. Jegelka and A. Gretton: Fast Kernel ICA using an Approximate Newton Method. AISTATS 2007.

Serfling, R. J. Approximation Theorems of Mathematical Statistics. Wiley-Interscience 1980.

Sun, X., Janzing, D. Schölkopf, B., and Fukumizu, K.: A kernel-based causal learning algorithm. Proc. 24th Annual International Conference on Machine Learning (ICML2007), 855-862 (2007)

53