kernel method: data analysis with positive definite …shimo/class/fukumizu/kernel...kernel method:...
TRANSCRIPT
Kernel Method: Data Analysis with PositiveDefinite Kernels
8. Dependence analysis with covariance on RKHS
Kenji Fukumizu
The Institute of Statistical Mathematics Graduate University for Advanced Studies /
Tokyo Institute of Technology
1
Nov. 17-26, 2010Intensive Course at Tokyo Institute of Technology
Outline
1. Covariance operators on RKHS
2. Independence and dependence with kernels
3. Conditional independence with kernels]
4. Kernel dimension reduction
2
1. Covariance operators on RKHS
2. Independence and dependence with kernels
3. Conditional independence with kernels
4. Kernel dimension reduction
3
Covariance on RKHS(X , Y): random variable taking values on ΩX x ΩY. (HX, kX), (HY , kY): RKHS with kernels on ΩX and ΩY, resp.Assume
Cross-covariance operator:
– Note: a linear map is a (1,1)-tensor.– c.f. Euclidean case
VYX = E[YXT] – E[Y]E[X]T : covariance matrix
4
)])(),([Cov()]([)]([)]()([, YgXfXfEYgEXfYgEfg YX =−=Σfor all YX HgHf ∈∈ ,
( ) ],[, XaYbCovaVb TTYX =
XYXYYX mmXYE ⊗−Φ⊗Φ≡Σ )]()([
XY HH ⊗∈
Proposition
YX HHYX →Σ :
.)],([,)],([ ∞<∞< YYkEXXkE YX
XYYX PPP mm ⊗−=
– Fact: is Hilbert Schmidt operator.
– Integral expression:
5
YXΣ
∑∑∑∑∞
=
∞
=
∞
=
∞
=
⊗⊗−=Σ=Σ1 1
2
)(1 1
22 ,,i j
jiYXXYi j
iYXjHSYX mmm ψϕϕψ
2
)(YX HHYXXY mmm
⊗⊗−=
( ) ( ) ),()()],([),()( YXdPXfYykEYykyfYX ∫ −=Σ YY
∵) Plug in Proposition. ),( ⋅= ykg Y
Characterization of Independence• Independence and Cross-covariance operator
TheoremIf the product kernel is characteristic on ΩX x ΩY, then
proof)
– c.f. for Gaussian variables
– c.f. Characteristic function
6
OXY =Σ⇔X and Y are independent
OVXY =⇔ i.e. uncorrelated
YX kk
X Y
X Y
][][][ 11)(1 vYY
uXX
vYuXXY eEeEeE −−+− =⇔
⇔=Σ OXY
⇔ (by characteristic assumption)
YXXY PPP mm ⊗=
YXXY PPP ⊗=
• Intuition: High-order momentsSuppose X and Y are R-valued, and k(x,u) admits the
expansion
W.r.t. basis 1, u, u2, u3, …, the random variables on RKHS are expressed by
7
++++= 333
22211),( uxcuxcxucuxk
~YXΣ
],[],[],[0],[],[],[0],[],[],[0
0000
3323
2323
313
3232
2222
212
331
221
21
XYCovcXYCovccXYCovccXYCovccXYCovcXYCovccXYCovccXYCovccXYCovc
TXcXcXcuXkX ),,,,1(~),()( 33
221 =Φ
TYcYcYcuYkY ),,,,1(~),()( 33
221 =Φ
The operator ΣYX contains all the high-order moments between X and Y.
)exp(),( xuuxk =e.g.)
Estimation of Cross-covariance Operator: i.i.d. sample on X x Y
An estimator of ΣYX is defined by
Theorem
8
)(,),( ,1,1 NN YXYX
∑=
−⋅⊗−⋅=ΣN
iXiYi
NYX mXkmYk
N 1
)( ˆ),(ˆ),(1ˆXY
( ) )(1ˆ )( ∞→=Σ−Σ NNOpHSYXN
YX
Corollary to the -consistency of the empirical mean, because the norm in is equal to the Hilbert-Schmidt norm of the corresponding operator .
NYX HH ⊗
YX HH →
1. Covariance operators on RKHS
2. Independence and dependence with kernels
3. Conditional independence with RKHS
4. kernel dimension reduction
9
Measuring Dependence
• (In)dependence measure (HSIC, Hilbert-Schmidt Independence Criterion, Gretton et al 2005)
• Empirical dependence measure
and can be used as measures of dependence.
10
2HSYXYXM Σ=
2)()( ˆˆHS
NYX
NYXM Σ=
⇔= 0YXM X Y with kXkY characteristic
HS-norm of Cross-covariance Operator
• Empirical estimatorGram matrix expression
HS-norm can be evaluated only in the subspaces and .
11
∑∑==
−=Σ=N
kjikiji
N
jijijiHS
NYX
NYX YYkXXk
NYYkXXk
NM
1,,3
1,2
2)()( ),(),(2),(),(1ˆˆYXYX
∑∑==
+N
kk
N
jiji YYkXXk
N 1,1,4 ),(),(1
YX
Ni
NXi mXk 1
)(ˆ),(Span =−⋅X )(ˆ),(Span NYi mYk −⋅Y
[ ]YXN
YX GGN
M Tr1ˆ2
)( =
,NXNX QKQG =where TNNNN N
IQ 111
−=
Or equivalently,
Normalized Covariance Operator
• Normalized Cross-Covariance Operator
NOCCO
• Characterization of independence
Assume WXY etc. are Hilbert-Schmidt.– Dependence measure
12
1/ 2 1/ 2YX YY YX XXW − −= Σ Σ Σ
YXW O= X Y⇔
With characteristic kernels,
2NOCCO YX HSW=
Kernel-free Integral Expression
Assume PXY have density
are characteristic. WYX is Hilbert-Schmidt.
Then,
13
),( yxpXY
YX HH ⊗
- Kernel-free expression, though the definitions are given by kernels!
- The RHS is χ2-divergence (mean square contingency), which is a well-known dependence measure
2|||| HSYXW dxdyypxpypxp
yxpYX
YX
XY )()(1)()(
),(2
∫∫
−=
Theorem (Fukumizu et al. NIPS 21, 2008)
Empirical Estimator
– Empirical estimation is straightforward with the empirical cross-covariance operator .
– Inversion regularization:
– Replace the covariances in by the empirical ones given by the data ΦX(X1),…, ΦX(XN) and ΦY(Y1),…, ΦY(YN)
– NOCCOemp gives a new kernel estimator for the χ2-divergence. Consistency is known.
14
( ) 1X X X N NR G G N Iε −≡ +
[ ]NOCCO Tremp X YR R=
2/12/1 −− ΣΣΣ= XXYXYYYXW
(dependence measure)
( ) 11 ( )ˆ NXX XX Iε
−−Σ → Σ +
where
( )ˆ NYXΣ
( ) ( )TNNNNX
TNNNNX IKIG 11 11 11 −−= ( )N
jijiX XXkK1,
),(=
=
Independence Test with Kernels I
• Independence test with positive definite kernels
– Null hypothesis H0: X and Y are independent– Alternative H1: X and Y are not independent
and NOCCOemp can be used for test statistics.
15
∑∑==
−=Σ=N
kjikiji
N
jijijiHS
NYX
NYX YYkXXk
NYYkXXk
NM
1,,3
1,2
2)()( ),(),(2),(),(1ˆˆYXYX
∑∑==
+N
kk
N
jiji YYkXXk
N 1,1,4 ),(),(1
YX
Independence test with kernels II• Asymptotic distribution under null-hypothesis
Theorem (Gretton et al. 2008)If X and Y are independent, then
where
– The proof is standard by the theory of degenerate U (or V)-statistics (see e.g. Serfling 1980, Chapter 5).
16
∑∞
=⇒
1
2)(ˆi
iiN
YX ZMN λ in law )( ∞→N
is the eigenvalues of the following integral operator ∞=1iiλ
Zi : i.i.d. ~ N(0,1),
∑ +−= ),,,( ,,,,,,!41 2),,,( dcba dcbacabababadcba kkkkkkUUUUh YXYXYX
Ua = (Xa, Ya)
)()(),,,( aiiUUUbidcba udPdPdPuuuuuhdcb
ϕλϕ =∫
),,(, baba XXkk XX =
Independence test with kernels III• Consistency of test
Theorem (Gretton et al. 2008)If MYX is not zero, then
where
17
( ) ),0(ˆ 2)( σNMMN YXN
YX ⇒− in law )( ∞→N
[ ]( )YXdcbadcba MUUUUhEE −= 2,,
2 ],,,([16σ
Choice of Kernel
• How to choose a kernel?– No definitive solutions have been proposed yet. – For statistical tests, comparison of power or efficiency will be
desirable. – Other suggestions:
• Make a relevant supervised problem, and use cross-validation.• Some heuristics
– Heuristics for Gaussian kernels (Gretton et al 2007)
– Speed of asymptotic convergence (Fukumizu et al. 2008)
Compare the bootstrapped variance and the theoretical one, and choose the parameter to give the minimum discrepancy.
18
jiXX ji ≠− |σ = median
2 2( )lim 2Nemp XX YYHS HSN
Var N HSIC→∞
× = Σ Σ under independence
Application to Independence Test• Toy example
19
-3 -2 -1 0 1 2 3-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3-3
-2
-1
0
1
2
3
X1
Y1
X2
Y2
X3
Y3
independent independentdependent
θ = 0 θ = π/4 θ = π/2
They are all uncorrelated, but dependent for 0 < θ < π/2
20
Angle 0.0 4.5 9.0 13.5 18.0 22.5HSIC (Median) 93 92 63 5 0 0HSIC (Asymp. Var.) 93 44 1 0 0 0NOCCO (ε = 104, Median) 94 23 0 0 0 0NOCCO (ε = 106, Median) 92 20 1 0 0 0NOCCO (ε = 108, Median) 93 15 0 0 0 0NOCCO (Asymp. Var.) 94 11 0 0 0 0MI (#NN = 1) 93 62 11 0 0 0MI (#NN = 3) 96 43 0 0 0 0MI (#NN = 5) 97 49 0 0 0 0Power Diverg. (#Bins=3) 96 92 43 9 1 0Power Diverg. (#Bins=4) 98 29 0 0 0 0Power Diverg. (#Bins=5) 94 60 2 0 0 0
indep. more dependent
N = 200. Permutation test is used for independence test except contingency table.
# acceptance of independence out of 100 tests (significance level = 5%)MI: mutual information estimated by the nearest neighbor method.
• Power Divergence (Ku&Fine05, Read&Cressie)– Make partition : Each dimension is divided into q
parts so that each bin contains almost the same number of data.
– Power-divergence
– Null distribution under independence
21
JjjA ∈
∑ ∏∈ =
−
+==
Jj
N
k
kjjjN k
pppNmXIT 1ˆˆˆ)2(
2),(21
)(λ
λ
λλ
: frequency in Aj: marginal freq. in r-th interval
21−+−
⇒ NqNqN NT χ
jp)(ˆ k
rpI2 = Mean Square Conting.I0 = MI
Independent Test on Text– Data: Official records of Canadian Parliament in English and
French. • Dependent data: 5 line-long parts from English texts
and their French translations. • Independent data: 5 line-long parts from English texts
and random 5 line-parts from French texts. – Kernel: Bag-of-words and spectral kernel
22
Results of permutations test with HS measureTopic Match BOW(N=10) Spec(N=10) BOW(N=50) Spec(N=50)
Agri- Random 0.94 0.95 0.93 0.95culture Same 0.18 0.00 0.00 0.00
Fishery Random 0.94 0.94 0.93 0.95Same 0.20 0.00 0.00 0.00
Immig- Random 0.96 0.91 0.94 0.95ration Same 0.09 0.00 0.00 0.00
(Gretton et al. 2007)Acceptance rate (α = 5%)
23
Independence Test: Comparison– Brownian distance covariance (Székely & Rizzo AOAS 2010)
• Independence with the characteristic functions
• Independence measure with weighted integral:
• With a clever choice of the weight w, the integral is reduced to HSIC-like measure with
∫ − ξωξωξφωφξωφ ddwYXXY ),()()(),( 2
2121 ),( xxxxk −=
YXXY φφφ =X Y ⇔
],[)( 1 ωωφTX
X eE −= ],[)( 1 ξξφTY
Y eE −= ].[),( )(1 ξωξωφTT YX
XY eE +−=
24
angle : 0 π/12 π/6 π/4dX = dY= 2 HS 0.94 0.77 0.48 0.42N =128 SR 0.95 0.83 0.66 0.65dX = dY= 2 HS 0.92 0.47 0.17 0.12N = 512 SR 0.93 0.49 0.38 0.33dX = dY= 4 HS 0.92 0.60 0.35 0.23N = 1024 SR 0.93 0.68 0.48 0.47dX = dY= 4 HS 0.92 0.44 0.15 0.12N = 2048 SR 0.94 0.46 0.29 0.27
indep. more dependent
HS: Hilbert-Schmidt norm.Gaussian kernelσ = med||Xi – Xi||
SR: Székely & Rizzo
(Gretton, F, Sriperumbudur. 2010. AOAS Discussion)
% of acceptance of indep. in permutation tests (α = 5%).
Application: ICA
• Independent Component Analysis (ICA)– Assumption
• m independent source signals• m observations of linearly mixed signals
– Problem
• Restore the independent signals S from observations X.
25
A
s1(t)
s2(t)
s3(t)x3(t)
x2(t)
x1(t)
A: m x m invertiblematrix
)()( tAStX =
BXS =ˆ B: m x m orthogonal matrix
• ICA with HS independence measure
Pairwise-independence criterion is applicable.
Objective function is non-convex. Optimization is not easy. Approximate Newton method has been proposed
Fast Kernel ICA (FastKICA, Shen et al 07)
• Other methods for ICASee, for example, Hyvärinen et al. (2001).
26
∑ ∑= >
=m
a abba YYMBL
1),(ˆ)( BXY =
)()1( ,..., NXX : i.i.d. observation (m-dimensional)
Minimize
(Software downloadable at Arthur Gretton’s homepage)
• Experiments (speech signal)
27
A
s1(t)
s2(t)
s3(t)x3(t)
x2(t)
x1(t)
randomlygenerated
B
Fast KICA
Three speech signals
1. Covariance operators on RKHS
2. Independence and dependence with kernels
3. Conditional independence with kernels
4. Kernel dimension reduction
28
Re: Statistics on RKHS
• Linear statistics on RKHS
– Basic statistics Basic statisticson Euclidean space on RKHS
Mean Kernel meanCovariance Cross-covariance operatorConditional covariance Cond. cross-covariance operator
– Plan: define the basic statistics on RKHS and derive nonlinear/ nonparametric statistical methods in the original space. 29
Ω (original space)Φ
feature map H (RKHS)
XΦ (X) = k( , X)
YXΣ
Conditional Independence
• DefinitionX, Y, Z: random variables with joint p.d.f. X and Y are conditionally independent given Z, if
• Applications– Graphical model– Causal inference, etc. 30
)|(),|( || zypxzyp ZYZXY =
)|()|()|,( ||| zypzxpzyxp ZYZXZXY =or
),,( zyxpXYZ
YX Z
With Z known, the information of Xis unnecessary for the inference on Y
(A)
(B)
YX
Z(B)
(A)
Conditional Independence for Gaussian Variables
• Two characterizationsX,Y,Z are Gaussian.
– Conditional covariance
– Comparison of conditional variance
31
X Y | Z OV ZXY =⇔ | i.e. OVVVV ZXZZYZYX =− −1
X Y | Z ZYYZXYY VV |],[| =⇔
Linear Regression and Conditional Covariance• Review: linear regression
– X, Y: random vector (not necessarily Gaussian) of dim p and q.
– Linear regression: predict Y using the linear combination of X.Minimize the mean square error:
– The residual error is given by the conditional covariance matrix.
– For Gaussian variables,
32
][~],[~ YEYYXEXX −=−=
2
matrix:
~~min XAYEpqA
−×
[ ]XYYpqAVXAYE |
2
matrix:Tr~~min =−
×
ZYYZXYY VV |],[| =
“If Z is known, X is not necessary for linear prediction of Y.”can be interpreted as
X Y | Z⇔
Review: Conditional Covariance
• Conditional covariance of Gaussian variables– Jointly Gaussian variable
: m ( = p + q) dimensional Gaussian variable
– Conditional probability of Y given X is again Gaussian
33
),,(),,,( 11 qp YYYXXX ==),( YXZ =
),(~ VNZ µ
=
=
YYYX
XYXX
Y
X
VVVV
V,µµ
µ
)(]|[ 1| XXXYXYXY xVVxXYE µµµ −+==≡ −
XYXXYXYYXYY VVVVxXYVarV 1| ]|[ −−==≡
),(~ || XYYXY VN µ
Cond. mean
Cond. covariance
Note: VYY|X does not depend on x
Schur complement of VXX in V
Conditional Covariance on RKHS
• Conditional Cross-covariance operatorX, Y, Z : random variables on ΩX, ΩY, ΩZ (resp.).(HX, kX), (HY , kY), (HZ , kZ) : RKHS defined on ΩX, ΩY, ΩZ (resp.).
– Conditional cross-covariance operator
– Conditional covariance operator
– may not exist as a bounded operator. But, we can justify the definitions.
34
ZXZZYZYXZYX ΣΣΣ−Σ≡Σ −1| YX HH →:
ZYZZYZYYZYY ΣΣΣ−Σ≡Σ −1| YY HH →:
1−ΣZZ
• Decomposition of covariance operator
WYX is the ‘correlation’ operator.is defined by the eigendecomposition.
• Rigorous definitions
35
2/12/1XXYXYYYX W ΣΣ=Σ
2/12/1| XXZXYZYYYXZYX WW ΣΣ−Σ≡Σ
such that WYX is a bounded operator with and
.)()(,)()( XXYXYYYX RangeWKerRangeWRange Σ⊥Σ=
1|||| ≤YXW
2/12/1| YYZYYZYYYYZYY WW ΣΣ−Σ≡Σ
2/1XXΣ
Conditional Covariance
• Conditional covariance is expressed by operatorsProposition (FBJ 2004, 2008+)
Assume kZ is characteristic.
In particular,
Proof omitted. Analogy to Gaussian variables:
36
[ ]]|)(),([Cov, | ZXfYgEfg ZYX =Σ
[ ]]|)([Var, | ZYgEgg ZYY =Σ
),( YX HgHf ∈∈∀
)( YHg ∈∀
]|,[Cov)( 1 ZXaYbaVVVVb TTZXZZYZYX
T =− −
]|[Var)( 1 ZYbbVVVVb TZYZZYZYY
T =− −
Mean Square Error InterpretationProposition (FBJ 2004, 2009)
Assume kZ is characteristic.
c.f. for Gaussian variables
37
[ ] 2| )(~)(~inf]|)([, ZfYgEZYgVarEgg
ZHfZYY −==Σ∈
)].([)()(~)],([)()(~ YgEYgYgXfEXfXf −=−=
)( YHg ∈∀
where
2|
~~min]|[ ZaYbZYbVarbVb TT
a
TZYY
T −==
38
– Proof (left = right)
( ) ( )2)]([)()]([)( ZfEZfYgEYgE −−−
gggfff YYZYZZ Σ+Σ−Σ= ,,2,22/12/12/122/1 ,2 ggWff YYYYZYZZZZ Σ+ΣΣ−Σ=
22/122/122/12/1 gWggWf YYZYYYYYZYZZ Σ−Σ+Σ−Σ=
This part can be arbitrary small by choosing fbecause of
( )gWWggWf YYZYYZYYYYYYZYZZ2/12/122/12/1 , ΣΣ−Σ+Σ−Σ=
ZYY |Σ
.)()( ZZZY RangeWRange Σ=
Conditional Independence with Kernels
Theorem (FBJ2004, 2008+)Assume kZ and kX kY kZ are characteristic.
Assume kZ , kY , kX kZ are characteristic.
– c.f. Gaussian variables
39
),( ZYY =whereX Y | Z
X Y | Z OV ZXY =⇔ |
ZYYZXYY VV |],[| =⇔X Y | Z
ZYYZXYY |][| Σ=ΣX Y | Z ⇔
OZXY =Σ⇔ |
– Intuition of the condition
If we already know X, the mean square error in predicting Ydoes not decrease, if information Z is added.
In general,
40
ZYYZXYY |][| Σ=Σ
]|)([],|)([ XYgZXYg VarVar =
.|][| ZYYZXYY Σ≤Σ
Empirical Estimator of Conditional Covariance Operator
(X1, Y1, Z1), ... , (XN, YN, ZN)
– Empirical conditional covariance operator
– Estimator of Hilbert-Schmidt norm
41
( ) )(1)()()()(|
ˆˆˆˆ:ˆ NZXN
NZZ
NYZ
NYX
NZYX I Σ+ΣΣ−Σ=Σ
−ε
)(ˆ NYZYZ Σ→Σ etc.
( ) 1)(1 ˆ −− +Σ→Σ INN
ZZZZ ε regularization for inversion
finite rank operators
[ ]ZYZXHSN
ZYX SGSGTrˆ 2)(| =Σ
,NXNX QKQG = TNNNN N
IQ 111
−= centered Gram matrix
( ) 111)( −− +=+−= ZNNZNNZNZ GIGINGISNεε
Consistency
• Consistency on conditional covariance operator
Theorem (FBJ08, Sun et al. 07)Assume and
In particular,
42
)(0ˆ|
)(| ∞→→Σ−Σ N
HSZYXN
ZYX
0→Nε ∞→NNε
)(ˆ|
)(| ∞→Σ→Σ N
HSZYXHSN
ZYX
Applications of Conditional Independence
• Conditional independence test (Fukumizu et al. 2008)
– Estimation of graphical model by data.
• Causality– Causal relations among variables can be formulated in terms
of conditional independence or Markov network. (Sun et al 2007)
– Granger causality for time series.(Xt) is not a cause of (Yt) if
• And more43
),...,|(),...,,,...,|( 111 ptttpttpttt YYYpXXYYYp −−−−−− =
ptt XX −− ,...,1 ptt YY −− ,...,| 1tY
1. Covariance operators on RKHS
2. Independence and dependence with kernels
3. Conditional independence with kernels
4. Kernel dimension reduction
44
4545
– Regression: Y : response variable, X=(X1,...,Xm): m-dim. explanatory variable
– Goal of dimension reduction for regression= Find an effective direction for regression (EDR space)
( ))|(~),...,|(~)|( 1 XBYpXbXbYpXYp TTd
T ==
B=(b1,..,bd): matrixdm× d is fixed.
Dimension Reduction for Regression
X Y | BTX
XU V
Y
U = BTX
4646
Kernel Dimension Reduction(Fukumizu, Bach, Jordan JMLR 2004, AS 2009)
Use characteristic kernels for BTX and Y.
– KDR objective function
– KDR empirical objective function
|| T YY XYY B XΣ = Σ ⇔
|| T YY XYY B XΣ ≥ Σ
X Y | BTX
|:min Tr TT
dYY B XB B B I=
Σ
EDR space
( )[ ]1
:Trmin −
=+ NNXBYIBBB
INGG T
dT
ε
∞=1 iiψ : ONB of HY
( ) ( )∑∞
==−−−
1
2
:)]([)()]([)(infmin
i
TTiiIBBB
XBfEXBfYEYEd
Tψψ
Equivalently
4747
• Wide applicability of KDR– The most general approach to dimension reduction:
• no strong model is used for p(Y|X) or p(X) . • no strong assumptions on the distribution of X, Y and
dimensionality/type of Y. – Most conventional methods have some restrictions, such as
the elliptic assumption for p(X) for SIR.
• Computational issues– Non-convex objective function, possibly local minima. Gradient method with an annealing technique
starting from a large σ in Gaussian RBF kernel. – Computational cost with matrices of sample size. Low-rank approximation.
KDR method
4848
Consistency of KDR
Suppose kd is bounded and continuous, and
Let S0 be the set of the optimal parameters;
Estimator:
Then, under some conditions, for any open set
1/ 20, ( ).N NN Nε ε→ → ∞ → ∞
Theorem (FBJ2009)
0U S⊃
( )( )ˆPr 1 ( ).NB U N∈ → → ∞
[ ] [ ] XBYYBXBYYdT
TTIBBBS '|'|0 TrminTr,| Σ=Σ==
( )[ ]1
:
)( Trminˆ −
=+= NNXBYIBBB
N INGGB T
dT
ε
4949
Numerical Results with KDR
• Synthetic data
.)1()5.1(5.0
222
2
1 WXXXY +++
++=
),0(~ : 4. INX dim 4
).,0(~ 2τNW τ = 0.1, 0.4, 0.8.
KDR SIR SAVE pHdτ Mean SD Mean SD Mean SD Mean SD0.1 0.11 0.07 0.55 0.28 0.77 0.35 1.04 0.340.4 0.17 0.09 0.60 0.27 0.82 0.34 1.03 0.330.8 0.34 0.22 0.69 0.25 0.94 0.35 1.06 0.33
Frobenius norms between the estimator and the true one over 100 samples (means and standard deviations)
Sample size N = 100
±±±
±±±
±±±
±±±
5050
• Wine data13 dim. 178 data.Y = 3 class label2 dim. projection
-20 -10 0 10 20
-20
-15
-10
-5
0
5
10
15
20
-20 -10 0 10 20
-20
-15
-10
-5
0
5
10
15
20
-20 -10 0 10 20
-20
-15
-10
-5
0
5
10
15
20 CCA
Partial Least Square
Sliced Inverse Regression
-20 -10 0 10 20-20
-15
-10
-5
0
5
10
15
20 KDR
( )1 2
2 21 2
( , )
exp
k z z
z z σ= − −
Summary• Dependence analysis with RKHS
– Covariance and conditional covariance on RKHS can capture the (in)dependence and conditional (in)dependence of random variables.
– Easy estimators can be obtained for the Hilbert-Schmidt norm of the operators.
– If the normalized covariance is used, the Hilbert-Schmidt norm is independent of kernel (χ2-divergence), assuming it is characteristic.
– Statistical tests of independence and conditional independence are possible with kernel measures.
– Applications: dimension reduction for regression (FBJ04, FBJ09), causal inference (Sun et al. 2007).
51
References
Fukumizu, K. Francis R. Bach and M. Jordan. Kernel dimension reduction in regression. The Annals of Statistics. 37(4), pp.1871-1905 (2009).
Fukumizu, K., A. Gretton, X. Sun, and B. Schölkopf: Kernel Measures of Conditional Dependence. Advances in Neural Information Processing Systems 21, 489-496, MIT Press (2008).
Fukumizu, K., Bach, F.R., and Jordan, M.I. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research. 5(Jan):73-99, 2004.
Gretton, A., K. Fukumizu, C.H. Teo, L. Song, B. Schölkopf, and Alex Smola. A Kernel Statistical Test of Independence. Advances in Neural Information Processing Systems 20. 585-592. MIT Press 2008.
Gretton, A., K. M. Borgwardt, M. Rasch, B. Schölkopf and A. Smola: A Kernel Method for the Two-Sample-Problem. Advances in Neural Information Processing Systems 19, 513-520. 2007.
Gretton, A., O. Bousquet, A. Smola and B. Schölkopf. Measuring Statistical Dependence with Hilbert-Schmidt Norms. Proc. Algorithmic Learning Theory (ALT2005), 63-78. 2005.
52
Shen, H., S. Jegelka and A. Gretton: Fast Kernel ICA using an Approximate Newton Method. AISTATS 2007.
Serfling, R. J. Approximation Theorems of Mathematical Statistics. Wiley-Interscience 1980.
Sun, X., Janzing, D. Schölkopf, B., and Fukumizu, K.: A kernel-based causal learning algorithm. Proc. 24th Annual International Conference on Machine Learning (ICML2007), 855-862 (2007)
53