kernel method for bayesian inference

35
Kernel Method for Bayesian Inference Kenji Fukumizu The Institute of Statistical Mathematics, Tokyo. Joint work with Le Song (CMU) and Arthur Gretton (Univ. College London, Max Planck Institute) Nov. 10, 2010. ACML2010, Tokyo (Slides revised Nov. 13)

Upload: others

Post on 08-Apr-2022

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kernel Method for Bayesian Inference

Kernel Method for Bayesian Inference

Kenji Fukumizu

The Institute of Statistical Mathematics, Tokyo.

Joint work with Le Song (CMU) and Arthur Gretton (Univ. CollegeLondon, Max Planck Institute)

Nov. 10, 2010. ACML2010, Tokyo(Slides revised Nov. 13)

Page 2: Kernel Method for Bayesian Inference

Outline

Introduction

Brief Review of Kernel Method

Kernel Bayesian Inference

Experimental Results

Conclusions

2 / 35

Page 3: Kernel Method for Bayesian Inference

Introduction

Brief Review of Kernel Method

Kernel Bayesian Inference

Experimental Results

Conclusions

3 / 35

Page 4: Kernel Method for Bayesian Inference

Kernel Method in a Big Picture

• Kernel method = a systematic way of mapping data into ahigh-dimensional reproducing kernel Hilbert space (RKHS)to extract higher order moments or nonlinearity.

xi

H

Feature map

Feature space (RKHS)

xixj

Space of original data

Feature map

X =⇒ Φ(X) (random vector on H),

• Linear statistical methods are applied on RKHS:SVM, kernel PCA, etc.

4 / 35

Page 5: Kernel Method for Bayesian Inference

Overview: Inference with Kernel Mean

Basic statistics on RKHS are already useful.

• Kernel mean: E[Φ(X)] can characterize the probability ofX.

• Applied to nonparametric statistical inference.• homogeneity test (Gretton et al. 2007),• independence test (Gretton et al 2008)• conditional independence test (Fukumizu et al 2008),• dimension reduction (F., Bach, Jordan, 2004, 2010), etc.

5 / 35

Page 6: Kernel Method for Bayesian Inference

Overview: Kernel Bayesian Inference

• Bayes’ rule:

q(x|y) =p(y|x)π(x)

qY(y),

qY(y) =

∫p(y|x)π(x)dx.

• Of course, there are many ways of computing /approximating Bayes’ rule. e.g. MCMC, importancesampling, sequential MC, variational method, EP, etc. Yet,its computation is challenging.

• This talk: kernel way of computing Bayes’ rule.

Express the kernel mean of posterior by that of posteriorand likelihood.

6 / 35

Page 7: Kernel Method for Bayesian Inference

Introduction

Brief Review of Kernel Method

Kernel Bayesian Inference

Experimental Results

Conclusions

7 / 35

Page 8: Kernel Method for Bayesian Inference

Positive Definite Kernel

Def. Let X be a set. k : X × X → R is a positive definite kernelif k(x, y) = k(y, x) and for any x1, . . . , xn ∈ X the symmetricmatrix

(k(xi, xj))ni,j=1 =

k(x1, x1) · · · k(x1, xn)...

. . ....

k(xn, x1) · · · k(xn, xn)

(Gram matrix)

is positive semidefinite.

Examples. (on Rm)

• Gaussian kernel: exp(− 1

2σ2 ‖x− y‖2)

.

• Polyn. kernel: (xT y + c)d (c ≥ 0, d ∈ N).Hk = {poly. deg ≤ d}.

8 / 35

Page 9: Kernel Method for Bayesian Inference

Reproducing Kernel Hilbert Space

Theorem (Moore-Aronszajn (1950))

Let k : X × X → C (or R) be a positive definite kernel on a setX . Then, there uniquely exists a Hilbert space Hk consisting offunctions on X such that

1. k(·, x) ∈ Hk for every x ∈ X ,

2. Span{k(·, x) | x ∈ X} is dense in Hk,3. k is the reproducing kernel on Hk, i.e.

〈f, k(·, x)〉Hk = f(x) (∀x ∈ X , ∀f ∈ Hk).(reproducing property)

RKHS is used as a feature space, which may be infinitedimensional.

9 / 35

Page 10: Kernel Method for Bayesian Inference

Data Analysis with Positive Definite Kernels

• Feature map: mapping random variable or data:

X 7→ Φ(X) = k(·, X) random variable on Hk,

X 3 X1, . . . , Xn 7→ Φ(X1), . . . ,Φ(Xn) ∈ Hk

• Kernel trick: inner product is easily computable.

〈Φ(Xi),Φ(Xj)〉 = k(Xi, Xj) (Gram matrix)

• Linear methods are extendable to RKHS with efficientcomputation.

• Typically, problems can be reduced to Gram matrices ofsample size.

10 / 35

Page 11: Kernel Method for Bayesian Inference

Mean and Covariance on RKHS I

X ∼ P : random variable on X . kX : pos. def. kernel on X .

• Def. mP : kernel mean of X on Hk

mP := E[Φ(X)] = E[k(·, X)] =

∫k(·, x)dP (x) ∈ Hk.

• Fact: 〈f,mP 〉 = E[f(X)]. (reproducing property)

• mP expresses higher-order moments of X.e.g. suppose k(u, x) = c0 + c1(ux) + c2(ux)2 + · · · (ci > 0).

mX(u) = c0 + c1E[X]u+ c2E[X2]u2 + · · ·

11 / 35

Page 12: Kernel Method for Bayesian Inference

Mean and Covariance on RKHS II(X,Y ): random vector on X × Y, ∼ P . kX , kY : pos. def.kernels on X ,Y (resp).

• Def. (uncentered) cross-covariance operator

CPYX : HX → HY , 〈g, CPYX f〉 = E[g(Y )f(X)].

• Covariance operator

CPXX : HX → HX , 〈h,CPXX f〉 = E[h(X)f(X)].

• (Cross-)covariance operator = mean in the product space.

CPYX ⇐⇒ mP = E[kY(·, Y )⊗ kX (·, X)] ∈ HX ⊗HY .

∵ 〈g, CPY,X f〉 = 〈g ⊗ f,mP 〉.

12 / 35

Page 13: Kernel Method for Bayesian Inference

Mean and Covariance on RKHS III

Given (X1, Y1), . . . , (Xn, Yn) ∼ P , i.i.d.,

• Empirical Estimation:

mX =1

n

n∑i=1

kX (·, Xi),

CY X =1

n

n∑i=1

kY(·, Yi)⊗ kX (·, Xi).

• Typically, Gram matrix expression is obtained.

• Op(n−1/2)-consistency in RKHS-norm is guaranteed(Gretton et al. 2005, etc).

13 / 35

Page 14: Kernel Method for Bayesian Inference

Characteristic Kernel: Representing ClassP: the set of all probabilities on a measurable space (X ,B).

Def. (F., Bach, Jordan 2004, 2009) k is called characteristic if

P → H, P 7→ mP

is injective, i.e., EX∼P [k(·, X)] = EX∼Q[k(·, X)] ⇐⇒ P = Q.

• Example. Gaussian kernel, Laplacian kernel. (Sriperumbudur

et al. 2010)

• With characteristic kernels,

Inference on P =⇒ Inference on mP

- two sample test ⇒ mP = mQ?

- independence test ⇒ mXY = mX ⊗mY ?

• Hereafter, all kernels are assumed to be characteristic.14 / 35

Page 15: Kernel Method for Bayesian Inference

Introduction

Brief Review of Kernel Method

Kernel Bayesian Inference

Experimental Results

Conclusions

15 / 35

Page 16: Kernel Method for Bayesian Inference

Bayes’ Rule

Bayes’ Rule:

q(x|y) =q(x, y)

qY(y)=p(y|x)π(x)

qY(y), qY(y) =

∫q(x, y)dx.

Π: prior (p.d.f. π(x)).P : joint distribution to give likelihood p(y|x).

Kernel realization:

• Given mΠ (kernel mean of Π) and CPYX , CPXX (covariance

operators of p(x, y)), express the kernel mean of theposterior

mQX |y :=

∫kX (·, x)q(x|y)dx

16 / 35

Page 17: Kernel Method for Bayesian Inference

Conditional Probabilities with Kernels I• Basic Proposition (F., Bach, Jordan 2004)

If E[g(Y )|X = ·] ∈ HX for g ∈ HY , then

CPXXE[g(Y )|X = ·] = CPXYg.

∵) 〈f, CPXXE[g(Y )|X = ·]〉 = E[f(X)E[g(Y )|X]] = E[f(X)g(Y )] = 〈f, CPXYg〉.

• Expression of kernel mean of conditional probability p(y|x):

E[kY(·, Y )|X = x] = CPYXCPXX−1kX (·, x).

(A bit naive, but can be justified.)

∵) E[g(Y )|X = ·] = C−1XXCXYg =⇒

〈g,E[kY(·, Y )|X = x]〉 = 〈C−1XXCXYg, kX (·, x)〉 = 〈g, CYXC

−1XXkX (·, x)〉.

17 / 35

Page 18: Kernel Method for Bayesian Inference

Kernel Mean of PosteriorQ: joint probability with p.d.f. q(x, y) = p(y|x)π(x).

Kernel mean of posterior

mQX |y := EQ[kX (·, X)|Y = y] = CQXYCQYY−1kY(·, y).

• Ingredients:

mQY = CPY XCPXX

−1mΠ ←→ qY(y) =

∫p(y|x)π(x)dx.

Recall: covariance = mean of product.

CQXY ⇐⇒ mQ = CP(XY)XCPXX−1mΠ ∈ HX ⊗HY ,

CQYY ⇐⇒ mQY×Y = CP(YY)XC

PXX

−1mΠ ∈ HY ⊗HY ,

CP(XY)X = EP[(kX (·, X)⊗kY(·, Y )

)⊗kX (·, X)

]: HX → HX⊗HY ,

CP(YY)X = EP[(kY(·, Y )⊗kY(·, Y )

)⊗kX (·, X)

]: HX → HY⊗HY .

18 / 35

Page 19: Kernel Method for Bayesian Inference

Kernel Bayes’ RuleKernel Bayes’ Rule

(X1, Y1), . . . , (Xn, Yn) ∼ P , i.i.d. mΠ =∑`

j=1 γjkX (·, Uj).Gram matrix expression of the kernel mean of posterior is

mQX |y =

n∑i=1

wi(y)kX (·, Xi), w(y) = LY (L2Y + δnIn)−1ΛkY (y),

for any y ∈ Y, where

LY = ΛGY , Λ = Diag((GX + nεnIn)−1GXUγ

),

kY = (kY(·, Y1), . . . , kY(·, Yn))T ,

GX = (kX (Xi, Xj)), GY = (kY(Yi, Yj)), GXU = (kX (Xi, Uj)),

and εn, δn are regularization constants.

• The posterior is given by a weighted sample (Xi, wi), whilethe weights may not be positive.

19 / 35

Page 20: Kernel Method for Bayesian Inference

Consistency

TheoremAssumptions:

• π/pX ∈ R(AXCPXX

1/2).

• ‖mΠ −mΠ‖HX = Op(n−α) (n→∞) for some 0 < α ≤ 1/2.

• AY : HY → L2(PY ), f 7→ f is injective.

• E[f(X)|Y = ·] ∈ HY for any f ∈ HX , and S : HX → HY ,f 7→ E[f(X)|Y = ·] makes (CQYY)−νS bounded for ν > 0.

With εn = n−23α and δn = n

−max{ 415α, 4

3(ν+3)α}, for any y ∈ Y∥∥mQX |y −mQX |y

∥∥HX

= Op(n−min{ 4

15α, 2ν

3(ν+3)α}

), (n→∞).

Note: the rate does not depend on the dimensionality.c.f. Kernel density estimation.

20 / 35

Page 21: Kernel Method for Bayesian Inference

Kernel Bayesian Inference I

Kernel Bayesian Inference (KBI) = ‘Nonparametric’ Bayesianinference using kernel Bayes’ rule.

• Likelihood p(y|x) and the prior π(x) are given by samples.

Case I Explicit form of likelihood p(y|x) is unavailable, but samplingfrom p(y|x) is easy.c.f. Approximate Bayesian Computation (ABC).

Case II Likelihood p(y|x) is unknown, but sample from p(x, y) isgiven in training phase (discussed later).

• If both of p(y|x) and π(x) are known, there are many goodnumerical / approximation methods, such as MCMC, SMC,variational Bayes, etc.

21 / 35

Page 22: Kernel Method for Bayesian Inference

Kernel Bayesian Inference II• Kernel Bayesian Inference estimates the kernel meanmQX |y =

∫kX (·, x)q(x|y)dx, not the posterior q(x|y) itself.

• How to use for inference?• Expectation: for f =

∑ni=1 fikX (·, Xi),∫

f(x)q(x|y)dx ←−n∑i=1

fiwi(y).

• Approximate MAP solution:

maxx

mQX |y(x)

may be solved iteratively.

• Time complexity: matrix inversion costs O(n3), but withlow-rank (r) approximation, KBI costs O(nr2).

22 / 35

Page 23: Kernel Method for Bayesian Inference

KBI for Nonparametric Hidden Markov Model

Model: p(X,Y ) = π(X1)∏Tt=1p(Yt|Xt)

∏T−1t=1 q(Xt+1|Xt),

X0 X1 X2 X3 XT

Y0 Y1 Y2 Y3 YT

• Assume• p(y|x) and/or q(x|x′) is not known.• But, sample (Xt, Yt)

Tt=1 is available in training phase.

• Testing phase:• given y1, . . . , yt, compute maxxs p(xs|y1, . . . , yt).

=⇒ Kernel Bayesian inference: maxXs mxs|y1,...,yt .

• E.g. when measurement of hidden states is expensive, orwhen hidden states are measured with time delay inpredicting future state.

23 / 35

Page 24: Kernel Method for Bayesian Inference

• Sequential filtering:

mxt|y1,...,yt =∑T

i=1α(t)i kX (·, Xi), α(t) = α(t)(y1, . . . , yt).

• Update rule:

µ(t+1) = (GX + TεT IT )−1GX,X+1(GX + TεT IT )−1GXα(t).

α(t+1) = L(t+1)Y

((L

(t+1)Y )2 + δT IT

)−1Λ(t+1)kY (yt+1).

GX,X+1: “transfer" matrix

(GX,X+1

)ij

= kX (Xi, Xj+1).

Λ(t+1) = diag(µ(t+1)1 , . . . , µ

(t+1)T ) and L(t+1)

Y = Λ(t+1)GY ,

• Prediction and smoothing are similar.

• The computational cost for each update is O(Tr2), oncelow-rank (r) approximation is used for training sample.

24 / 35

Page 25: Kernel Method for Bayesian Inference

Introduction

Brief Review of Kernel Method

Kernel Bayesian Inference

Experimental Results

Conclusions

25 / 35

Page 26: Kernel Method for Bayesian Inference

Comparison with Approx. Bayesian Computation

Assume: p(y|x) is not explicitly known, but sampling is possible.

Approximate Bayesian Computation (ABC): existing method ofsampling from posterior.

• Procedure (sampling and rejection):1. y given.2. Sample Xi ∼ Π. Yi ∼ p(Y |Xi).3. If d(y, Yi) < ε, accept Xi.4. Repeat 2 and 3.

Accepted sample X1, . . . , XN ∼ q(X|y) approximately.

• Exact if ε→ 0, but acceptance rate is small particularlywhen the dimension of X is large.

• Proposed and used mainly in population genetics.

26 / 35

Page 27: Kernel Method for Bayesian Inference

Experimental results

• task: E[X|Y = y], evaluated at 10 different points of y. 10random runs.

• Gaussian prior and likelihood so that the truth can be calculated.

• Gaussian kernels are used for KBI.

• Incomplete Cholesky is used for the low-rank approximation inKBI (εn = tolerance ∝ 1/n, δn = 2εn).

100

101

102

103

10−2

CPU time vs Error (2 dim.)

CPU time (sec)

Av.

Mea

n S

quar

e E

rror

s

KBIABC

800

1000

1500

6.1× 102

1.5× 103

2.7× 103

4.9× 103

1.1× 104

3.1× 104

1.4× 1045000

6000

4000

3000

2000

600

400

200

100

101

102

103

104

10−1

CPU time vs Error (6 dim.)

CPU time (sec)

Av.

Mea

n S

quar

e E

rror

s

KBIABC

2006.4×104

1.0×104

2.5×103

1000

400 600

800

2000

7.9×105

5.1×102

27 / 35

Page 28: Kernel Method for Bayesian Inference

Experiments on Nonparametric Filtering(a) Noisy rotation{( ut+1

vt+1

)=( cos θt+1

sin θt+1

)+ Zt, θt+1 = arctan(vt/ut) + 0.3,

Yt = (ut, vt)T +Wt,

Zt ∼ N(0, 0.22I2),Wt ∼ N(0, 0.22I).

Approximate MAP solution are computed by KBI.

200 400 600 800 10000.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Training sample size

Mea

n sq

uare

err

ors

KBFEKFUKF

- - - - - 0 0 1 1 2 2-

-

-

-

-

0

0

1

1

2

2

Note: KBI does not know the dynamics, while EKF and UKF use theexact knowledge.

28 / 35

Page 29: Kernel Method for Bayesian Inference

(b) Noisy oscillation{( ut+1vt+1

)= (1 + 0.4 sin(8θt+1))

( cos θt+1

sin θt+1

)+ Zt, θt+1 = arctan(vt/ut) + 0.4,

Yt = (ut, vt)T +Wt,

Zt ∼ N(0, 0.22I2), Wt ∼ N(0, 0.22I).

200 400 600 8000.05

0.06

0.07

0.08

0.09

Training data size

Mea

n sq

uare

err

ors

KBFEKFUKF

‑2 ‑1 0 1 2‑2

‑1.5

‑1

‑0.5

0

0.5

1

1.5

2

29 / 35

Page 30: Kernel Method for Bayesian Inference

Estimation of Camera Angle• Hidden Xt: angle of a camera.

• Observed Yt: movie frame of a room + additive Gaussian noise.

• Data: Synthesized by POV-Ray (http://www.povray.org).Xt: 3600 downsampled frames of 20× 20 RGB pixels (1200dim.). The first 1800 frames are used for training, and thesecond half is used for test.

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100

30 / 35

Page 31: Kernel Method for Bayesian Inference

Results

R9 SO(3)

KBI (Gauss) Kalman (R9) KBI (Tr) Kalman (Q∗)σ2 = 10−4 0.21± 0.02 1.98± 0.08 0.15± < 0.01 0.56± 0.02

σ2 = 10−3 0.22± 0.01 1.94± 0.06 0.21± 0.01 0.54± 0.02

Average MSE of estimating camera angles (10 runs)

• For Kalman filter, dynamics is estimated with linear Gaussianmodel.

• In R9 model, Gaussian kernel for KBI.

• In SO(3) model, Tr[AB] for KBI, and quaternion expression forKalman filter.

31 / 35

Page 32: Kernel Method for Bayesian Inference

Introduction

Brief Review of Kernel Method

Kernel Bayesian Inference

Experimental Results

Conclusions

32 / 35

Page 33: Kernel Method for Bayesian Inference

Conclusions

• Kernel Bayes’ rule: kernel way of realizing Bayes’ rulenonparametrically.

Kernel mean of posterior can be computed given samplesfrom the prior and likelihood.

• No explicit form of the likelihood is needed.• Consistency is guaranteed, and the rate does not depend

on the dimensionality.• Computational cost is linear w.r.t. sample size if low-rank

approximation is used.

• Future / on-going works:• Kernel (parameter) choice?• Applications to various Bayesian inference.• Combination of parametric and non-parametric HMM

(parametric for transition, nonparametric for observation).

33 / 35

Page 34: Kernel Method for Bayesian Inference

Thank you!

34 / 35

Page 35: Kernel Method for Bayesian Inference

References

[1] Fukumizu, K., L. Song and A. Gretton. (2010) Kernel Bayes’rule. arXiv:1009.5736v2 [stat.ML]

[2] Song, L., J. Huang, A. Smola and K. Fukumizu. (2009) HilbertSpace Embeddings of Conditional Distributions withApplications to Dynamical Systems. Proc. ICML2009, 961–968.

[3] Fukumizu, K., F.R. Bach and M.I. Jordan. (2009) Kerneldimension reduction in regression. Ann. Stat. 37(4),pp.1871–1905.

[4] Fukumizu, K., F.R. Bach and M.I. Jordan. (2004) Dimensionalityreduction for supervised learning with reproducing kernel Hilbertspaces. JMLR. 5, pp.73–99.

35 / 35