kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... ·...

Kernel nonparametric tests ofhomogeneity, independence, and

multi-variable interaction

Imperial, 2015

Arthur Gretton

Gatsby Unit, CSML, UCL

Motivating example: detect differences in AM signals

Samples from P Samples from Q

Case of discrete domains

• How do you compare distributions. . .

• . . .in a discrete domain? [Read and Cressie, 1988]

Case of discrete domains

• How do you compare distributions. . .


X1: Now disturbing reports out of Newfound-

land show that the fragile snow crab industry is

in serious decline. First the west coast salmon,

the east coast salmon and the cod, and now the

snow crabs off Newfoundland.

Y1: Honourable senators, I have a question for

the Leader of the Government in the Senate with

regard to the support funding to farmers that has

been announced. Most farmers have not received

any money yet.

X2: To my pleasant surprise he responded that

he had personally visited those wharves and that

he had already announced money to fix them.

What wharves did the minister visit in my riding

and how much additional funding is he going to

provide for Delaps Cove, Hampton, Port Lorne,

· · ·

?

PX = PY

Y2:On the grain transportation system we have

had the Estey report and the Kroeger report.

We could go on and on. Recently programs have

been announced over and over by the government

such as money for the disaster in agriculture on

the prairies and across Canada.

· · ·

Are the pink extracts from the same distribution as the gray ones?

Detecting statistical dependence

• How do you detect dependence. . .





!"#$%&' ()'*+,' -./,'

#0.1+' 2345' 2326'

78'.0.1+' 2325' 2396'




!"#$%&' ()'*+,' -./,'

#0.1+' 2342' 2352'

67'.0.1+' 2358' 2389'

Detecting statistical dependence, discrete domain



X1: Honourable senators, I have a ques-

tion for the Leader of the Government in the

Senate with regard to the support funding

to farmers that has been announced. Most

farmers have not received any money yet.

Y1: Honorables senateurs, ma question

s’adresse au leader du gouvernement au

Senat et concerne l’aide financiere qu’on a

annoncee pour les agriculteurs. La plupart

des agriculteurs n’ont encore rien reu de cet

argent.

X2: No doubt there is great pressure on

provincial and municipal governments in re-

lation to the issue of child care, but the re-

ality is that there have been no cuts to child

care funding from the federal government to

the provinces. In fact, we have increased

federal investments for early childhood de-

velopment.

· · ·

?

PXY = PXPY

Y2:Il est evident que les ordres de

gouvernements provinciaux et municipaux

subissent de fortes pressions en ce qui con-

cerne les services de garde, mais le gou-

vernement n’a pas reduit le financement

qu’il verse aux provinces pour les services de

garde. Au contraire, nous avons augmente le

financement federal pour le developpement

des jeunes enfants.

· · ·

Are the French text extracts translations of the English ones?

Detecting statistical dependence, continuous domain


• . . .in a continuous domain?

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

X

Y

Sample from PXY

ր?

ց

Dependent PXY

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

Independent PXY

=PX P

Y

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5




−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

X

Y

Sample from PXY

ր?

ց

Discretized empirical PXY

Discretized empirical PX P

Y




• Problem: fails even in “low” dimensions! [NIPS07a, ALT08]

– X and Y in R4, statistic=Power divergence, samples= 1024, cases

where dependence detected=0/500

• Too few points per bin

Outline

• An RKHS metric on the space of probability measures

– Distance between means in space of features (RKHS)

– Nonparametric two-sample test...

– ...for (almost!) any data type: strings, images, graphs, groups

(rotation matrices), semigroups (histograms),...

– Two-sample testing for big data, how to choose the kernel [NIPS12]

• Dependence detection

– Distance between joint Pxy and product of marginals PxPy

– Testing three-way interactions [NIPS13]

• Nonparametric Bayesian inference

Feature mean difference

• Simple example: 2 Gaussians with different means

• Answer: t-test

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Two Gaussians with different means

X

Pro

b. density

P

X

QX


• Two Gaussians with same means, different variance

• Idea: look at difference in means of features of the RVs

• In Gaussian case: second order features of form ϕ(x) = x2

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Two Gaussians with different variances

Pro

b.

de

nsity

X

P

X

QX


• Two Gaussians with same means, different variance

• Idea: look at difference in means of features of the RVs

• In Gaussian case: second order features of form ϕx = x2

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Two Gaussians with different variances

Pro

b.

de

nsity

X

P

X

QX

10−1

100

101

102

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Densities of feature X2

X2

Pro

b.

de

nsity

P

X

QX


• Gaussian and Laplace distributions

• Same mean and same variance

• Difference in means using higher order features

−4 −3 −2 −1 0 1 2 3 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Gaussian and Laplace densities

Pro

b. density

X

P

X

QX

. . . so let’s explore feature representations!

Kernels: similarity between features

Kernel:

• We have two objects x and x′ from a set X (documents, images, . . .).

How similar are they?

Kernels: similarity between features

Kernel:

• We have two objects x and x′ from a set X (documents, images, . . .).

How similar are they?

• Define features of objects:

– ϕx are features of x,

– ϕx′ are features of x′

• A kernel is the dot product between these features:

k(x, x′) := 〈ϕx, ϕx′〉F .

Good features make problems easy: XOR example

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

x1

x2

• No linear classifier separates red from blue

• Map points to higher dimensional feature space:

φx =[x1 x2 x1x2

]∈ R3

Good features make problems easy: XOR example

Example: A three dimensional space of features of points in R2:

ϕ : R2 → R

3

x =

x1

x2

7→ ϕx =

x1

x2

x1x2

,

with kernel

k(x, y) =

x1

x2

x1x2

⊤

y1

y2

y1y2

(the standard inner product in R3 between features). Denote this feature

space by F .

Functions of features

Define a linear function of the inputs x1, x2, and their product x1x2,

f(x) = f1x1 + f2x2 + f3x1x2.

f in a space of functions mapping from X = R2 to R. Equivalent

representation for f ,

f =[f1 f2 f3

]⊤.

f refers to the function as an object (here as a vector in R3)

f(x) ∈ R is function evaluated at a point (a real number).

Functions of features

Define a linear function of the inputs x1, x2, and their product x1x2,

f(x) = f1x1 + f2x2 + f3x1x2.

f in a space of functions mapping from X = R2 to R. Equivalent

representation for f ,

f =[f1 f2 f3

]⊤.

f refers to the function as an object (here as a vector in R3)

f(x) ∈ R is function evaluated at a point (a real number).

f(x) = f⊤ϕx = 〈f, ϕx〉FEvaluation of f at x is an inner product in feature space (here standard

inner product in R3)

F is a space of functions mapping R2 to R.

The reproducing property

This example illustrates the two defining features of an RKHS:

• The reproducing property: ∀x ∈ X , ∀f ∈ F , 〈f, ϕx〉F = f(x)

• In particular, for any x, y ∈ X ,

k(x, y) = 〈ϕx, ϕy〉F .

Note: the feature map of every point is in the feature space: ∀x ∈ X , ϕx ∈ F

Infinite dimensional feature space

Reproducing property for function with Gaussian kernel:

f(x) :=∑m

i=1 αik(xi, x) = 〈∑mi=1 αiϕxi , ϕx〉F .

−6 −4 −2 0 2 4 6 8−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x

f(x)


Reproducing property for function with Gaussian kernel:

f(x) :=∑m

i=1 αik(xi, x) = 〈∑mi=1 αiϕxi , ϕx〉F .

−6 −4 −2 0 2 4 6 8−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x

f(x)

• What do the features ϕx look like (there are infinitely many of them!)

• What do these features have to do with smoothness?


Gaussian kernel, k(x, x′) = exp(−‖x−x′‖2

2σ2

),

λk ∝ bk b < 1

ek(x) ∝ exp(−(c− a)x2)Hk(x√2c),

a, b, c are functions of σ, and Hk is kth order Hermite polynomial.

k(x, x′)

=∞∑

i=1

λiei(x)ei(x′)

=∞∑

i=1

(√λiei(x)

)(√λiei(x

′))

=∞∑

i=1

ϕxϕx′


Example RKHS function, Gaussian kernel:

f(x) :=m∑

i=1

αik(xi, x) =m∑

i=1

αi

∞∑

j=1

λjej(xi)ej(x)

=

∞∑

j=1

fj

[√λjej(x)

]

where fj =∑m

i=1 αi

√λjej(xi).

−6 −4 −2 0 2 4 6 8−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x

f(x)

NOTE that this

enforces smoothing:

λj decay as ej

become rougher,

fj decay since

‖f‖2F =∑

j f2j <∞.

Probabilities in feature space: the mean trick

The kernel trick

• Given x ∈ X for some set X ,

define feature map ϕx ∈ F ,

ϕx =[. . .√λiei(x) . . .

]∈ ℓ2

• For positive definite k(x, x′),

k(x, x′) = 〈ϕx, ϕx′〉F

• The kernel trick: ∀f ∈ F ,

f(x) = 〈f, ϕx〉F

Probabilities in feature space: the mean trick

The kernel trick

• Given x ∈ X for some set X ,

define feature map ϕx ∈ F ,

ϕx =[. . .√λiei(x) . . .

]∈ ℓ2


k(x, x′) = 〈ϕx, ϕx′〉F

• The kernel trick: ∀f ∈ F ,


The mean trick

• Given P a Borel probability

measure on X , define feature

map µP ∈ F

µP =[. . .√λiEP [ei(X)] . . .

]∈ ℓ2


EP,Qk(X,Y ) = 〈µP, µQ〉F

for X ∼ P and Y ∼ Q.

• The mean trick: (we call µP a

mean/distribution embedding)

EP(f(X)) =: 〈µP, f〉F

Feature embeddings of probabilities

The kernel trick:


The mean trick:

EP(f(X)) = 〈f, µP〉FEmpirical mean embedding:

µP = m−1m∑

i=1

ϕxi xii.i.d.∼ P

µP gives you expectations of all RKHS functions

When k characteristic, then µP unique, e.g. Gauss, Laplace, . . .

The maximum mean discrepancy

The maximum mean discrepancy is the squared distance between feature

means:

MMD(P,Q) = ‖µP − µQ‖2F = 〈µP, µP〉F + 〈µQ, µQ〉F − 2 〈µP, µQ〉F= EPk(x, x

′) +EQk(y, y′)− 2EP,Qk(x, y)

The maximum mean discrepancy

The maximum mean discrepancy is the squared distance between feature

means:

MMD(P,Q) = ‖µP − µQ‖2F = 〈µP, µP〉F + 〈µQ, µQ〉F − 2 〈µP, µQ〉F= EPk(x, x

′) +EQk(y, y′)− 2EP,Qk(x, y)

An unbiased empirical estimate: for {xi}mi=1 ∼ P and {yi}mi=1 ∼ Q,

MMD = 1m(m−1)

∑mi=1

∑mj 6=i k(xi, xj)

+ 1m(m−1)

∑mi=1

∑mj 6=i k(yi, yj)

−2∑m

i=1

∑mj=1 k(yi, xj)

Maximum mean discrepancy: main idea

KP,P =

k(X1, X1) . . . k(X1, Xn)...

. . ....

k(Xn, X1) . . . k(Xn, Xn)

=

KQ,Q =

k(Y1, Y1) . . . k(Y1, Yn)...

. . ....

k(Yn, Y1) . . . k(Yn, Yn)

=

KP ,Q =

k(X1, Y1) . . . k(X1, Yn)...

. . ....

k(Xn, Y1) . . . k(Xn, Yn)

=

Maximum mean discrepancy: main idea

P 6= Q⇒

KP,P KP ,Q

KQ,P KQ,Q

=

P = Q⇒

KP,P KP,Q

KQ,P KQ,Q

=

Statistical hypothesis testing

Statistical test using MMD

• Two hypotheses:

– H0: null hypothesis (P = Q)

– H1: alternative hypothesis (P 6= Q)


• Two hypotheses:

– H0: null hypothesis (P = Q)

– H1: alternative hypothesis (P 6= Q)

• Observe samples x := {x1, . . . , xm} from P and y from Q

• If empirical MMD2is

– “far from zero”: reject H0

– “close to zero”: accept H0

Behavior of empirical MMD

Vn := MMD = KP,P +KQ,Q − 2KP,Q


• When P = Q, U-statistic degenerate: [Gretton et al., 2007, 2012a]

• Distribution is

mMMD2 ∼

∞∑

l=1

λl[z2l − 2

]

• where

– zl ∼ N (0, 2) i.i.d

–∫X k(x, x

′)︸︷︷︸centred

ψi(x)dP(x) = λiψi(x′)

−0.04 −0.02 0 0.02 0.04 0.06 0.08 0.10

10

20

30

40

50

MMD density under H0

MMD

Pro

b.

de

nsity

Empirical PDF

Approximate distribution of MMD via permutation

w = (−1, 1, 1, · · · ,−1,−1,−1)⊤

KP,P KP,Q

KQ,P KQ,Q

⊙

[ww⊤]

=[?]

⊙ =

⊙ =

Approximate distribution of MMD via permutation

Vn,p := MMDp =

KP,P KP ,Q

KQ,P KQ,Q

⊙

[ww⊤]

Experiment: amplitude modulated signals


Results: AM signals

−0.2 0 0.2 0.4 0.6 0.8 1 1.20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

added noise

Typ

e I

I e

rro

r

max ratio

opt

med

l2

maxmmd

m = 10, 000 and scaling a = 0.5. Average over 4124 trials. Gaussian noise

added.

Linear-cost test and kernel choice approach from Gretton et al. [2012b]

Kernel two-sample tests for big data, optimal kernel choice

Quadratic time estimate of MMD

MMD2 = ‖µP − µQ‖2F = EPk(x, x′) +EQk(y, y

′)− 2EP,Qk(x, y)



′)− 2EP,Qk(x, y)

Given i.i.d. X := {x1, . . . , xm} and Y := {y1, . . . , ym} from P,Q,

respectively:

The earlier estimate: (quadratic time)

EPk(x, x′) =

1

m(m− 1)

m∑

i=1

m∑

j 6=i

k(xi, xj)



′)− 2EP,Qk(x, y)

Given i.i.d. X := {x1, . . . , xm} and Y := {y1, . . . , ym} from P,Q,

respectively:

The earlier estimate: (quadratic time)

EPk(x, x′) =

1

m(m− 1)

m∑

i=1

m∑

j 6=i

k(xi, xj)

New, linear time estimate:

EPk(x, x′) =

2

m[k(x1, x2) + k(x3, x4) + . . .]

=2

m

m/2∑

i=1

k(x2i−1, x2i)

Linear time MMD

Shorter expression with explicit k dependence:

MMD2 =: ηk(p, q) = Exx′yy′hk(x, x′, y, y′) =: Evhk(v),

where

hk(x, x′, y, y′) = k(x, x′) + k(y, y′)− k(x, y′)− k(x′, y),

and v := [x, x′, y, y′].

Linear time MMD

Shorter expression with explicit k dependence:

MMD2 =: ηk(p, q) = Exx′yy′hk(x, x′, y, y′) =: Evhk(v),

where

hk(x, x′, y, y′) = k(x, x′) + k(y, y′)− k(x, y′)− k(x′, y),

and v := [x, x′, y, y′].

The linear time estimate again:

ηk =2

m

m/2∑

i=1

hk(vi),

where vi := [x2i−1, x2i, y2i−1, y2i] and

hk(vi) := k(x2i−1, x2i) + k(y2i−1, y2i)− k(x2i−1, y2i)− k(x2i, y2i−1)

Linear time vs quadratic time MMD

Disadvantages of linear time MMD vs quadratic time MMD

• Much higher variance for a given m, hence. . .

• . . .a much less powerful test for a given m

Linear time vs quadratic time MMD

Disadvantages of linear time MMD vs quadratic time MMD

• Much higher variance for a given m, hence. . .

• . . .a much less powerful test for a given m

Advantages of the linear time MMD vs quadratic time MMD

• Very simple asymptotic null distribution (a Gaussian, vs an infinite

weighted sum of χ2)

• Both test statistic and threshold computable in O(m), with storage O(1).

• Given unlimited data, a given Type II error can be attained with less

computation

Asymptotics of linear time MMD

By central limit theorem,

m1/2 (ηk − ηk(p, q))D→ N (0, 2σ2k)

• assuming 0 < E(h2k) <∞ (true for bounded k)

• σ2k = Evh2k(v)− [Ev(hk(v))]

2 .

Hypothesis test

Hypothesis test of asymptotic level α:

tk,α = m−1/2σk√2Φ−1(1− α) where Φ−1 is inverse CDF of N (0, 1).

−4 −2 0 2 4 6 80

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4Null distribution, linear time MMD

2

= ηk

P(η

k)

ηk

Type I error

tk,α = (1 − α) quantile

Type II error

−4 −2 0 2 4 6 80

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

P(η

k)

ηk

Null vs alternative distribution, P (ηk)

null

alternative

Type II error ηk(p, q )

The best kernel: minimizes Type II error

Type II error: ηk falls below the threshold tk,α and ηk(p, q) > 0.

Prob. of a Type II error:

P (ηk < tk,α) = Φ

(Φ−1(1− α)− ηk(p, q)

√m

σk√2

)

where Φ is a Normal CDF.

The best kernel: minimizes Type II error

Type II error: ηk falls below the threshold tk,α and ηk(p, q) > 0.

Prob. of a Type II error:

P (ηk < tk,α) = Φ

(Φ−1(1− α)− ηk(p, q)

√m

σk√2

)

where Φ is a Normal CDF.

Since Φ monotonic, best kernel choice to minimize Type II error prob. is:

k∗ = argmaxk∈K

ηk(p, q)σ−1k ,

where K is the family of kernels under consideration.

Learning the best kernel in a family

Define the family of kernels as follows:

K :=

{k : k =

d∑

u=1

βuku, ‖β‖1 = D, βu ≥ 0, ∀u ∈ {1, . . . , d}}.

Properties: if at least one βu > 0

• all k ∈ K are valid kernels,

• If all ku charateristic then k characteristic

Test statistic

The squared MMD becomes

ηk(p, q) = ‖µk(p)− µk(q)‖2Fk=

d∑

u=1

βuηu(p, q),

where ηu(p, q) := Evhu(v).

Test statistic

The squared MMD becomes

ηk(p, q) = ‖µk(p)− µk(q)‖2Fk=

d∑

u=1

βuηu(p, q),

where ηu(p, q) := Evhu(v).

Denote:

• β = (β1, β2, . . . , βd)⊤ ∈ Rd,

• h = (h1, h2, . . . , hd)⊤ ∈ Rd,

– hu(x, x′, y, y′) = ku(x, x

′) + ku(y, y′)− ku(x, y

′)− ku(x′, y)

• η = Ev(h) = (η1, η2, . . . , ηd)⊤ ∈ Rd.

Quantities for test:

ηk(p, q) = E(β⊤h) = β⊤η σ2k := β⊤cov(h)β.

Optimization of ratio ηk(p, q)σ−1k

Empirical test parameters:

ηk = β⊤η σk,λ =

√β⊤(Q+ λmI

)β,

Q is empirical estimate of cov(h).

Note: ηk, σk,λ computed on training data, vs ηk, σk on data to be tested

(why?)

Optimization of ratio ηk(p, q)σ−1k

Empirical test parameters:

ηk = β⊤η σk,λ =

√β⊤(Q+ λmI

)β,

Q is empirical estimate of cov(h).

Note: ηk, σk,λ computed on training data, vs ηk, σk on data to be tested

(why?)

Objective:

β∗ = argmaxβ�0

ηk(p, q)σ−1k,λ

= argmaxβ�0

(β⊤η

)(β⊤(Q+ λmI

)β)−1/2

=: α(β; η, Q)

Optmization of ratio ηk(p, q)σ−1k

Assume: η has at least one positive entry

Then there exists β � 0 s.t. α(β; η, Q) > 0.

Thus: α(β∗; η, Q) > 0




Thus: α(β∗; η, Q) > 0

Solve easier problem: β∗ = argmaxβ�0 α2(β; η, Q).

Quadratic program:

min{β⊤(Q+ λmI

)β : β⊤η = 1, β � 0}




Thus: α(β∗; η, Q) > 0

Solve easier problem: β∗ = argmaxβ�0 α2(β; η, Q).

Quadratic program:

min{β⊤(Q+ λmI

)β : β⊤η = 1, β � 0}

What if η has no positive entries?

Test procedure

1. Split the data into testing and training.

2. On the training data:

(a) Compute ηu for all ku ∈ K(b) If at least one ηu > 0, solve the QP to get β∗, else choose random

kernel from K3. On the test data:

(a) Compute ηk∗ using k∗ =∑d

u=1 β∗ku

(b) Compute test threshold tα,k∗ using σk∗

4. Reject null if ηk∗ > tα,k∗

Convergence bounds

Assume bounded kernel, σk, bounded away from 0.

If λm = Θ(m−1/3) then

∣∣∣∣supk∈K

ηkσ−1k,λ − sup

k∈Kηkσ

−1k

∣∣∣∣ = OP

(m−1/3

).

Convergence bounds

Assume bounded kernel, σk, bounded away from 0.

If λm = Θ(m−1/3) then



k∈Kηkσ

−1k

∣∣∣∣ = OP

(m−1/3

).

Idea:



k∈Kηkσ

−1k

∣∣∣∣

≤ supk∈K

∣∣∣ηkσ−1k,λ − ηkσ

−1k,λ

∣∣∣+ supk∈K

∣∣∣ηkσ−1k,λ − ηkσ

−1k

∣∣∣

≤√d

D√λm

(C1 sup

k∈K|ηk − ηk|+ C2 sup

k∈K|σk,λ − σk,λ|

)+ C3D

2λm,

Experiments

Competing approaches

• Median heuristic

• Max. MMD: choose ku ∈ K with the largest ηu

– same as maximizing β⊤η subject to ‖β‖1 ≤ 1

• ℓ2 statistic: maximize β⊤η subject to ‖β‖2 ≤ 1

• Cross validation on training set

Also compare with:

• Single kernel that maximizes ratio ηk(p, q)σ−1k

Blobs: data

Difficult problems: lengthscale of the difference in distributions not the same

as that of the distributions.

Blobs: data

Difficult problems: lengthscale of the difference in distributions not the same

as that of the distributions.

We distinguish a field of Gaussian blobs with different covariances.

5 10 15 20 25 30 355

10

15

20

25

30

35Blob data p

x1

x2

5 10 15 20 25 30 355

10

15

20

25

30

35Blob data q

y1

y2

Ratio ε = 3.2 of largest to smallest eigenvalues of blobs in q.

Blobs: results

0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ε ratio

Typ

eII

erro

r

max ratio

opt

l2

maxmmd

xval

xvalc

med

Parameters: m = 10, 000 (for training and test). Ratio ε of largest to

smallest eigenvalues of blobs in q. Results are average over 617 trials.

Blobs: results

0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ε ratio

Type II err

or

max ratiooptl2maxmmdxvalxvalcmed

Optimize ratio ηk(p, q)σ−1k

Blobs: results

0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ε ratio

Type II err

or


Maximize ηk(p, q) with β constraint

Blobs: results

0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ε ratio

Type II err

or


Median heuristic

Feature selection: data

Idea: no single best kernel.

Each of the ku are univariate (along a single coordinate)

Feature selection: data

Idea: no single best kernel.

Each of the ku are univariate (along a single coordinate)

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

8Selection data

x1

x2

p

q

Feature selection: results

0 5 10 15 20 25 300.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

Feature selection

dimension

Typ

e I

I e

rro

r

max ratio

opt

l2

maxmmd

Single best kernel

Linear combination

m = 10, 000, average over 5000 trials

Amplitude modulated signals

Given an audio signal s(t), an amplitude modulated signal can be defined

u(t) = sin(ωct) [a s(t) + l]

• ωc: carrier frequency

• a = 0.2 is signal scaling, l = 2 is offset


Given an audio signal s(t), an amplitude modulated signal can be defined

u(t) = sin(ωct) [a s(t) + l]

• ωc: carrier frequency

• a = 0.2 is signal scaling, l = 2 is offset

Two amplitude modulated signals from same artist (in this case, Magnetic

Fields).

• Music sampled at 8KHz (very low)

• Carrier frequency is 24kHz

• AM signal observed at 120kHz

• Samples are extracts of length N = 1000, approx. 0.01 sec (very short).

• Total dataset size is 30,000 samples from each of p, q.

Results: AM signals

−0.2 0 0.2 0.4 0.6 0.8 1 1.20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

added noise

Typ

e I

I e

rro

r

max ratio

opt

med

l2

maxmmd

m = 10, 000 (for training and test) and scaling a = 0.5. Average over 4124

trials. Gaussian noise added.

Observations on kernel choice

• It is possible to choose the best kernel for a kernel

two-sample test

• Kernel choice matters for “difficult” problems, where the

distributions differ on a lengthscale different to that of

the data.

• Ongoing work:

– quadratic time statistic

– avoid training/test split

Testing Independence

MMD for independence

• Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05,

NIPS07a, ALT07, ALT08, JMLR10]

Related to [Feuerverger, 1993]and [Szekely and Rizzo, 2009, Szekely et al., 2007]

HSIC(PXY ,PXPY ) := ‖µPXY− µPXPY

‖2






‖2

k( , )!" #" !"l( , )#"

k( , )× l( , )!" #" !" #"

κ( , ) =!" #"!" #"






‖2

HSIC using expectations of kernels:

Define RKHS F on X with kernel k, RKHS G on Y with kernel l. Then

HSIC(PXY ,PXPY )

= EXY EX′Y ′k(x, x′)l(y, y′) +EXEX′k(x, x′)EY EY ′ l(y, y′)

− 2EX′Y ′

[EXk(x, x

′)EY l(y, y′)].

Experiment: dependence testing for translation

• Translation example: [NIPS07b]

Canadian Hansard

(agriculture)

• 5-line extracts,

k-spectrum kernel, k = 10,

repetitions=300,

sample size 10

• Empirical

MMD(PXY ,PXPY ):

1

n2(HKH ◦HLH)++

⇓

K

⇒MMD⇐

⇓

L

• k-spectrum kernel: average Type II error 0 (α = 0.05)

• Bag of words kernel: average Type II error 0.18

Lancaster (3-way) Interactions

Detecting a higher order interaction

• How to detect V-structures with pairwise weak individual dependence?

X Y

Z



• X ⊥⊥ Y , Y ⊥⊥ Z, X ⊥⊥ Z

X vs Y Y vs Z

X vs Z XY vs Z

X Y

Z

• X,Yi.i.d.∼ N (0, 1),

• Z| X,Y ∼ sign(XY )Exp( 1√2)

Faithfulness violated here

V-structure Discovery

X Y

Z

Assume X ⊥⊥ Y has been established. V-structure can then be detected by:

• Consistent CI test: H0 : X ⊥⊥ Y |Z [Fukumizu et al., 2008, Zhang et al., 2011], or

V-structure Discovery

X Y

Z

Assume X ⊥⊥ Y has been established. V-structure can then be detected by:

• Consistent CI test: H0 : X ⊥⊥ Y |Z [Fukumizu et al., 2008, Zhang et al., 2011], or

• Factorisation test: H0 : (X,Y ) ⊥⊥ Z ∨ (X,Z) ⊥⊥ Y ∨ (Y, Z) ⊥⊥ X

(multiple standard two-variable tests)

– compute p-values for each of the marginal tests for (Y, Z) ⊥⊥ X,

(X,Z) ⊥⊥ Y , or (X,Y ) ⊥⊥ Z

– apply Holm-Bonferroni (HB) sequentially rejective correction(Holm 1979)

V-structure Discovery (2)

• How to detect V-structures with pairwise weak (or nonexistent)

dependence?

• X ⊥⊥ Y , Y ⊥⊥ Z, X ⊥⊥ Z

X1 vs Y1 Y1 vs Z1

X1 vs Z1 X1*Y1 vs Z1

X Y

Z

• X1, Y1i.i.d.∼ N (0, 1),

• Z1| X1, Y1 ∼ sign(X1Y1)Exp(1√2)

• X2:p, Y2:p, Z2:pi.i.d.∼ N (0, Ip−1) Faith-

fulness violated here

V-structure Discovery (3)

CI: X ⊥⊥Y |Z

2var: Factor

Null

acc

epta

nce

rate

(Typ

eII

erro

r)V-structure discovery: Dataset A

Dimension

1 3 5 7 9 11 13 15 17 19

0

0.2

0.4

0.6

0.8

1

Figure 1: CI test for X ⊥⊥ Y |Z from Zhang et al (2011), and a factorisation test

with a HB correction, n = 500

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is

a signed measure ∆P that vanishes whenever P can be factorised in a

non-trivial way as a product of its (possibly multivariate) marginal

distributions.

• D = 2 : ∆LP = PXY − PXPY





distributions.

• D = 2 : ∆LP = PXY − PXPY

• D = 3 : ∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ





distributions.

• D = 2 : ∆LP = PXY − PXPY


X Y

Z

X Y

Z

X Y

Z

X Y

Z

PXY Z −PXPY Z −PY PXZ −PZPXY +2PXPY PZ

∆LP =





distributions.

• D = 2 : ∆LP = PXY − PXPY


X Y

Z

X Y

Z

X Y

Z

X Y

Z

PXY Z −PXPY Z −PY PXZ −PZPXY +2PXPY PZ

∆LP = 0

Case of PX ⊥⊥ PY Z





distributions.

• D = 2 : ∆LP = PXY − PXPY


(X,Y ) ⊥⊥ Z ∨ (X,Z) ⊥⊥ Y ∨ (Y, Z) ⊥⊥ X ⇒ ∆LP = 0.

...so what might be missed?





distributions.

• D = 2 : ∆LP = PXY − PXPY


∆LP = 0 ; (X,Y ) ⊥⊥ Z ∨ (X,Z) ⊥⊥ Y ∨ (Y, Z) ⊥⊥ X

Example:

P (0, 0, 0) = 0.2 P (0, 0, 1) = 0.1 P (1, 0, 0) = 0.1 P (1, 0, 1) = 0.1

P (0, 1, 0) = 0.1 P (0, 1, 1) = 0.1 P (1, 1, 0) = 0.1 P (1, 1, 1) = 0.2

A Test using Lancaster Measure

• Test statistic is empirical estimate of ‖µκ (∆LP )‖2Hκ, where

κ = k ⊗ l ⊗m:

‖µκ(PXY Z − PXY PZ − · · · )‖2Hκ=

〈µκPXY Z , µκPXY Z〉Hκ− 2 〈µκPXY Z , µκPXY PZ〉Hκ

· · ·

Inner Product Estimators

ν\ν′ PXY Z PXY PZ PXZPY PY ZPX PXPY PZ

PXY Z (K ◦ L ◦ M)++ ((K ◦ L)M)++ ((K ◦ M)L)++ ((M ◦ L)K)++ tr(K+ ◦ L+ ◦ M+)

PXY PZ (K ◦ L)++ M++ (MKL)++ (KLM)++ (KL)++M++

PXZPY (K ◦ M)++ L++ (KML)++ (KM)++L++

PY ZPX (L ◦ M)++ K++ (LM)++K++

PXPY PZ K++L++M++

Table 1: V -statistic estimators of 〈µκν, µκν ′〉Hκ

Inner Product Estimators

ν\ν′ PXY Z PXY PZ PXZPY PY ZPX PXPY PZ

PXY Z (K ◦ L ◦ M)++ ((K ◦ L)M)++ ((K ◦ M)L)++ ((M ◦ L)K)++ tr(K+ ◦ L+ ◦ M+)

PXY PZ (K ◦ L)++ M++ (MKL)++ (KLM)++ (KL)++M++

PXZPY (K ◦ M)++ L++ (KML)++ (KM)++L++

PY ZPX (L ◦ M)++ K++ (LM)++K++

PXPY PZ K++L++M++

Table 2: V -statistic estimators of 〈µκν, µκν ′〉Hκ

‖µκ (∆LP )‖2Hκ=

1

n2(HKH ◦HLH ◦HMH)++ .

Empirical joint central moment in the feature space

Example A: factorisation tests

CI: X ⊥⊥Y |Z

∆L: Factor

2var: Factor

Null

acc

epta

nce

rate

(Typ

eII

erro

r)

V-structure discovery: Dataset A

Dimension

1 3 5 7 9 11 13 15 17 19

0

0.2

0.4

0.6

0.8

1

Figure 2: Factorisation hypothesis: Lancaster statistic vs. a two-variable

based test (both with HB correction); Test for X ⊥⊥ Y |Z from Zhang et al

(2011), n = 500

Example B: Joint dependence can be easier to detect

• X1, Y1i.i.d.∼ N (0, 1)

• Z1 =

X21 + ǫ, w.p. 1/3,

Y 21 + ǫ, w.p. 1/3,

X1Y1 + ǫ, w.p. 1/3,

where ǫ ∼ N (0, 0.12).

• X2:p, Y2:p, Z2:pi.i.d.∼ N (0, Ip−1)

• dependence of Z on pair (X,Y ) is stronger than on X and Y individually

• Satisfies faithfulness

Example B: factorisation tests

CI: X ⊥⊥Y |Z

∆L: Factor

2var: Factor

Null

acc

epta

nce

rate

(Typ

eII

erro

r)

V-structure discovery: Dataset B

Dimension

1 3 5 7 9 11 13 15 17 19

0

0.2

0.4

0.6

0.8

1

Figure 3: Factorisation hypothesis: Lancaster statistic vs. a two-variable

based test (both with HB correction); Test for X ⊥⊥ Y |Z from Zhang et al

(2011), n = 500

Interaction for D ≥ 4

• Interaction measure valid for all D

(Streitberg, 1990):

∆SP =∑

π

(−1)|π|−1 (|π| − 1)!JπP

– For a partition π, Jπ associates to

the joint the corresponding

factorisation, e.g.,

J13|2|4P = PX1X3PX2PX4 .

Interaction for D ≥ 4

• Interaction measure valid for all D

(Streitberg, 1990):

∆SP =∑

π

(−1)|π|−1 (|π| − 1)!JπP

– For a partition π, Jπ associates to

the joint the corresponding

factorisation, e.g.,

J13|2|4P = PX1X3PX2PX4 .

1e+04

1e+09

1e+14

1e+19

1 3 5 7 9 11 13 15 17 19 21 23 25D

Nu

mb

er

of

pa

rtitio

ns o

f {1

,...

,D} Bell numbers growth

joint central moments (Lancaster interaction)

vs.

joint cumulants (Streitberg interaction)

Total independence test

• Total independence test:

H0 : PXY Z = PXPY PZ vs. H1 : PXY Z 6= PXPY PZ

Total independence test

• Total independence test:

H0 : PXY Z = PXPY PZ vs. H1 : PXY Z 6= PXPY PZ

• For (X1, . . . , XD) ∼ PX, and κ =⊗D

i=1 k(i):

∥∥∥∥∥∥∥∥∥∥∥

µκ

(PX −

D∏

i=1

PXi

)

︸︷︷︸∆totP

∥∥∥∥∥∥∥∥∥∥∥

2

Hκ

=1

n2

n∑

a=1

n∑

b=1

D∏

i=1

K(i)ab − 2

nD+1

n∑

a=1

D∏

i=1

n∑

b=1

K(i)ab

+1

n2D

D∏

i=1

n∑

a=1

n∑

b=1

K(i)ab .

• Coincides with the test proposed by Kankainen (1995) using empirical

characteristic functions: similar relationship to that between dCov and

HSIC (DS et al, 2013)

Example B: total independence tests

∆tot : total indep.

∆L: total indep.

Null

acc

epta

nce

rate

(Typ

eII

erro

r)

Total independence test: Dataset B

Dimension

1 3 5 7 9 11 13 15 17 19

0

0.2

0.4

0.6

0.8

1

Figure 4: Total independence: ∆totP vs. ∆LP , n = 500

Nonparametric Bayesian inference using distributionembeddings

Motivating Example: Bayesian inference without a model

• 3600 downsampled frames of

20× 20 RGB pixels

(Yt ∈ [0, 1]1200)

• 1800 training frames,

remaining for test.

• Gaussian noise added to Yt.

Challenges:

• No parametric model of camera dynamics (only samples)

• No parametric model of map from camera angle to image (only samples)

• Want to do filtering: Bayesian inference

ABC: an approach to Bayesian inference without a model

Bayes rule:

P(y|x) = P(x|y)π(y)∫P(x|y)π(y)dy

• P(x|y) is likelihood• π(y) is prior

One approach: Approximate Bayesian Computation (ABC)


Approximate Bayesian Computation (ABC):

−10 −5 0 5 10

−10

−5

0

5

10

Prior y ∼ π(Y )

Lik

eli

hood

x∼

P(X

|y)

ABC demonstration



−10 −5 0 5 10

−10

−5

0

5

10

Prior y ∼ π(Y )

Lik

eli

hood

x∼

P(X

|y)

ABC demonstration

x∗



−10 −5 0 5 10

−10

−5

0

5

10

Prior y ∼ π(Y )

Lik

eli

hood

x∼

P(X

|y)

ABC demonstration

x∗

P (Y |x∗)



−10 −5 0 5 10

−10

−5

0

5

10

Prior y ∼ π(Y )

Lik

eli

hood

x∼

P(X

|y)

ABC demonstration

x∗

P (Y |x∗)

Needed: distance measure D, tolerance parameter τ .


Bayes rule:


• P(x|y) is likelihood• π(y) is prior

ABC generates a sample from p(Y |x∗) as follows:1. generate a sample yt from the prior π,

2. generate a sample xt from P(X|yt),3. if D(x∗, xt) < τ , accept y = yt; otherwise reject,

4. go to (i).

In step (3), D is a distance measure, and τ is a tolerance parameter.

Motivating example 2: simple Gaussian case

• p(x, y) is N ((0,1Td )T , V ) with V a randomly generated covariance

Posterior mean on x: ABC vs kernel approach

100

101

102

103

104

10−2

10−1

CPU time vs Error (6 dim.)

CPU time (sec)

Mea

n S

qu

are

Err

ors

KBI

COND

ABC

3.3×102

1.3×106

7.0×104

1.0×104

2.5×103

9.5×102

200

400

2000

600800

15001000

50006000

40003000

200

400

2000

4000

600800

1000

1500

6000

30005000

Bayes again

Bayes rule:


• P(x|y) is likelihood• π is prior

How would this look with kernel embeddings?

Bayes again

Bayes rule:


• P(x|y) is likelihood• π is prior

How would this look with kernel embeddings?

Define RKHS G on Y with feature map ψy and kernel l(y, ·)

We need a conditional mean embedding: for all g ∈ G,

EY |x∗g(Y ) = 〈g, µP(y|x∗)〉G

This will be obtained by RKHS-valued ridge regression

Ridge regression and the conditional feature mean

Ridge regression from X := Rd to a finite vector output Y := Rd′ (these

could be d′ nonlinear features of y):

Define training data

X =[x1 . . . xm

]∈ R

d×m Y =[y1 . . . ym

]∈ R

d′×m





X =[x1 . . . xm

]∈ R

d×m Y =[y1 . . . ym

]∈ R

d′×m

Solve

A = argminA∈Rd′×d

(‖Y −AX‖2 + λ‖A‖2HS

),

where

‖A‖2HS = tr(A⊤A) =min{d,d′}∑

i=1

γ2A,i





X =[x1 . . . xm

]∈ R

d×m Y =[y1 . . . ym

]∈ R

d′×m

Solve

A = argminA∈Rd′×d

(‖Y −AX‖2 + λ‖A‖2HS

),

where

‖A‖2HS = tr(A⊤A) =min{d,d′}∑

i=1

γ2A,i

Solution: A = CY X (CXX +mλI)−1


Prediction at new point x:

y∗ = Ax

= CY X (CXX +mλI)−1 x

=m∑

i=1

βi(x)yi

where

βi(x) = (K + λmI)−1[k(x1, x) . . . k(xm, x)

]⊤

and

K := X⊤X k(x1, x) = x⊤1 x


Prediction at new point x:

y∗ = Ax

= CY X (CXX +mλI)−1 x

=m∑

i=1

βi(x)yi

where

βi(x) = (K + λmI)−1[k(x1, x) . . . k(xm, x)

]⊤

and

K := X⊤X k(x1, x) = x⊤1 x

What if we do everything in kernel space?


Recall our setup:

• Given training pairs:

(xi, yi) ∼ PXY

• F on X with feature map ϕx and kernel k(x, ·)• G on Y with feature map ψy and kernel l(y, ·)

We define the covariance between feature maps:

CXX = EX (ϕX ⊗ ϕX) CXY = EXY (ϕX ⊗ ψY )

and matrices of feature mapped training data

X =[ϕx1 . . . ϕxm

]Y :=

[ψy1 . . . ψym

]


Objective: [Weston et al. (2003), Micchelli and Pontil (2005), Caponnetto and De Vito (2007), Grunewalder

et al. (2012, 2013) ]

A = arg minA∈HS(F ,G)

(EXY ‖Y −AX‖2G + λ‖A‖2HS

), ‖A‖2HS =

∞∑

i=1

γ2A,i

Solution same as vector case:

A = CY X (CXX +mλI)−1 ,

Prediction at new x using kernels:

Aϕx =[ψy1 . . . ψym

](K + λmI)−1

[k(x1, x) . . . k(xm, x)

]

=

m∑

i=1

βi(x)ψyi

where Kij = k(xi, xj)


How is loss ‖Y −AX‖2G relevant to conditional expectation of some

EY |xg(Y )? Define: [Song et al. (2009), Grunewalder et al. (2013)]

µY |x := Aϕx




µY |x := Aϕx

We need A to have the property

EY |xg(Y ) ≈ 〈g, µY |x〉G= 〈g,Aϕx〉G= 〈A∗g, ϕx〉F = (A∗g)(x)




µY |x := Aϕx

We need A to have the property

EY |xg(Y ) ≈ 〈g, µY |x〉G= 〈g,Aϕx〉G= 〈A∗g, ϕx〉F = (A∗g)(x)

Natural risk function for conditional mean

L(A,PXY ) := sup‖g‖≤1

EX

(EY |Xg(Y )

)︸︷︷︸

Target

(X)− (A∗g)︸︷︷︸Estimator

(X)

2

,


The squared loss risk provides an upper bound on the natural risk.

L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G




Proof: Jensen and Cauchy Schwarz


EX

[(EY |Xg(Y )

)(X)− (A∗g) (X)

]2

≤ EXY sup‖g‖≤1

[g(Y )− (A∗g) (X)]2






EX

[(EY |Xg(Y )

)(X)− (A∗g) (X)

]2


[g(Y )− (A∗g) (X)]2

= EXY sup‖g‖≤1

[〈g, ψY 〉G − 〈A∗g, ϕX〉F ]2






EX

[(EY |Xg(Y )

)(X)− (A∗g) (X)

]2


[g(Y )− (A∗g) (X)]2


[〈g, ψY 〉G − 〈g,AϕX〉G ]2






EX

[(EY |Xg(Y )

)(X)− (A∗g) (X)

]2


[g(Y )− (A∗g) (X)]2


〈g, ψY −AϕX〉2G






EX

[(EY |Xg(Y )

)(X)− (A∗g) (X)

]2


[g(Y )− (A∗g) (X)]2



≤ EXY ‖ψY −AϕX‖2G






EX

[(EY |Xg(Y )

)(X)− (A∗g) (X)

]2


[g(Y )− (A∗g) (X)]2



≤ EXY ‖ψY −AϕX‖2G

If we assume EY [g(Y )|X = x] ∈ F then upper bound tight (next slide).

Conditions for ridge regression = conditional mean

Conditional mean obtained by ridge regression when EY [g(Y )|X = x] ∈ FGiven a function g ∈ G. Assume EY |X [g(Y )|X = ·] ∈ F . Then

CXXEY |X [g(Y )|X = ·] = CXY g.

Proof : [Fukumizu et al., 2004]

For all f ∈ F , by definition of CXX ,

⟨f, CXXEY |X [g(Y )|X = ·]

⟩F

= cov(f,EY |X [g(Y )|X = ·]

)

= EX

(f(X) EY |X [g(Y )|X]

)

= EXY (f(X)g(Y ))

= 〈f, CXY g〉 ,

by definition of CXY .

Kernel Bayes’ law

• Prior: Y ∼ π(y)

• Likelihood: (X|y) ∼ P(x|y) with some joint P(x, y)

Kernel Bayes’ law

• Posterior embedding via the usual conditional update,

µQ(y|x) = CQ(y,x)C−1Q(x,x)φx.

Kernel Bayes’ law

• Posterior embedding via the usual conditional update,

µQ(y|x) = CQ(y,x)C−1Q(x,x)φx.

• Given mean embedding of prior: µπ(y)

• Define marginal covariance:

CQ(x,x) =

∫(ϕx ⊗ ϕx) P(x|y)π(y)dx = C(xx)yC

−1yy µπ(y)

• Define cross-covariance:

CQ(y,x) =

∫(φy ⊗ ϕx) P(x|y)π(y)dx = C(yx)yC

−1yy µπ(y).

Kernel Bayes’ law: consistency result

• How to compute posterior expectation from data?

• Given samples: {(xi, yi)}ni=1 from Pxy, {(uj)}nj=1 from prior π.

• Want to compute E[g(Y )|X = x] for g in G• For any x ∈ X ,

∣∣gTy RY |XkX(x)− E[f(Y )|X = x]

∣∣ = Op(n− 4

27 ), (n→ ∞),

where

– gy = (g(y1), . . . , g(yn))T ∈ Rn.

– kX(x) = (k(x1, x), . . . , k(xn, x))T ∈ Rn

– RY |X learned from the samples, contains the uj

Smoothness assumptions:

• π/pY ∈ R(C1/2Y Y ), where pY p.d.f. of PY ,

• E[g(Y )|X = ·] ∈ R(C2Q(xx)

).

Experiment: Kernel Bayes’ law vs EKF


• Compare with extended Kalman

filter (EKF) on camera

orientation task


20× 20 RGB pixels

(Yt ∈ [0, 1]1200)

• 1800 training frames, remaining

for test.



• Compare with extended Kalman

filter (EKF) on camera

orientation task


20× 20 RGB pixels

(Yt ∈ [0, 1]1200)

• 1800 training frames, remaining

for test.


Average MSE and standard errors (10 runs)

KBR (Gauss) KBR (Tr) Kalman (9 dim.) Kalman (Quat.)

σ2 = 10−4 0.210± 0.015 0.146± 0.003 1.980± 0.083 0.557± 0.023

σ2 = 10−3 0.222± 0.009 0.210± 0.008 1.935± 0.064 0.541± 0.022

Co-authors

• From UCL:

– Kacper Chwialkowski

– Heiko Strathmann

• External:

– Wicher Bergsma, LSE

– Karsten Borgwardt, ETH

Zurich

– Kenji Fukumizu, ISM

– Bernhard Schoelkopf, MPI

– Alexander Smola,

CMU/Google

– Le Song, Georgia Tech

– Dino Sejdinovic, Oxford

– Bharath Sriperumbudur,

Penn State

Questions?

Energy Distance and the MMD

Energy distance and MMD

Distance between probability distributions:

Energy distance: [Baringhaus and Franz, 2004, Szekely and Rizzo, 2004, 2005]

DE(P,Q) = EP‖X −X ′‖+EQ‖Y − Y ′‖ − 2EP,Q‖X − Y ‖

Maximum mean discrepancy [Gretton et al., 2007, Smola et al., 2007, Gretton et al., 2012a]

MMD2(P,Q;F ) = EPk(X,X′) +EQk(Y, Y

′)− 2EP,Qk(X,Y )

Energy distance and MMD

Distance between probability distributions:

Energy distance: [Baringhaus and Franz, 2004, Szekely and Rizzo, 2004, 2005]

DE(P,Q) = EP‖X −X ′‖+EQ‖Y − Y ′‖ − 2EP,Q‖X − Y ‖

Maximum mean discrepancy [Gretton et al., 2007, Smola et al., 2007, Gretton et al., 2012a]

MMD2(P,Q;F ) = EPk(X,X′) +EQk(Y, Y

′)− 2EP,Qk(X,Y )

Energy distance is MMD with a particular kernel!

Distance covariance and HSIC

Distance covariance [Szekely and Rizzo, 2009, Szekely et al., 2007]

V2(X,Y ) = EXY EX′Y ′

[‖X −X ′‖‖Y − Y ′‖

]

+ EXEX′‖X −X ′‖EY EY ′‖Y − Y ′‖− 2EXY

[EX′‖X −X ′‖EY ′‖Y − Y ′‖

]

Hilbert-Schmdit Indepence Criterion [Gretton et al., 2005, Smola et al., 2007, Gretton et al.,

2008, Gretton and Gyorfi, 2010]Define RKHS F on X with kernel k, RKHS G on Ywith kernel l. Then

HSIC(PXY ,PXPY )

= EXY EX′Y ′k(X,X ′)l(Y, Y ′) +EXEX′k(X,X ′)EY EY ′ l(Y, Y ′)

− 2EX′Y ′

[EXk(X,X

′)EY l(Y, Y′)].

Distance covariance and HSIC

Distance covariance [Szekely and Rizzo, 2009, Szekely et al., 2007]

V2(X,Y ) = EXY EX′Y ′

[‖X −X ′‖‖Y − Y ′‖

]

+ EXEX′‖X −X ′‖EY EY ′‖Y − Y ′‖− 2EXY

[EX′‖X −X ′‖EY ′‖Y − Y ′‖

]

Hilbert-Schmdit Indepence Criterion [Gretton et al., 2005, Smola et al., 2007, Gretton et al.,

2008, Gretton and Gyorfi, 2010]Define RKHS F on X with kernel k, RKHS G on Ywith kernel l. Then

HSIC(PXY ,PXPY )

= EXY EX′Y ′k(X,X ′)l(Y, Y ′) +EXEX′k(X,X ′)EY EY ′ l(Y, Y ′)

− 2EX′Y ′

[EXk(X,X

′)EY l(Y, Y′)].

Distance covariance is HSIC with particular kernels!

Semimetrics and Hilbert spaces

Theorem [Berg et al., 1984, Lemma 2.1, p. 74]

ρ : X × X → R is a semimetric on X . Let z0 ∈ X , and denote

kρ(z, z′) = ρ(z, z0) + ρ(z′, z0)− ρ(z, z′).

Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS)

iff ρ is of negative type.

Call kρ a distance induced kernel




kρ(z, z′) = ρ(z, z0) + ρ(z′, z0)− ρ(z, z′).




Special case: Z ⊆ Rd and ρq(z, z′) = ‖z − z′‖q. Then ρq is a valid semimetric

of negative type for 0 < q ≤ 2.




kρ(z, z′) = ρ(z, z0) + ρ(z′, z0)− ρ(z, z′).




Special case: Z ⊆ Rd and ρq(z, z′) = ‖z − z′‖q. Then ρq is a valid semimetric

of negative type for 0 < q ≤ 2.

Energy distance is MMD with a distance induced kernel

Distance covariance is HSIC with distance induced kernels

Two-sample testing benchmark

Two-sample testing example in 1-D:

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

X

P(X

)

VS

−6 −4 −2 0 2 4 6

0.1

0.2

0.3

0.4

X

Q(X

)

−6 −4 −2 0 2 4 6

0.1

0.2

0.3

0.4

X

Q(X

)

−6 −4 −2 0 2 4 6

0.1

0.2

0.3

0.4

X

Q(X

)

Two-sample test, MMD with distance kernel

Obtain more powerful tests on this problem when q 6= 1 (exponent of distance)

Key:

• Gaussian kernel

• q = 1

• Best: q = 1/3

• Worst: q = 2

Selected references

Characteristic kernels and mean embeddings:

• Smola, A., Gretton, A., Song, L., Schoelkopf, B. (2007). A hilbert space embedding for distributions. ALT.

• Sriperumbudur, B., Gretton, A., Fukumizu, K., Schoelkopf, B., Lanckriet, G. (2010). Hilbert space

embeddings and metrics on probability measures. JMLR.

• Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR.

Two-sample, independence, conditional independence tests:

• Gretton, A., Fukumizu, K., Teo, C., Song, L., Schoelkopf, B., Smola, A. (2008). A kernel statistical test of

independence. NIPS

• Fukumizu, K., Gretton, A., Sun, X., Schoelkopf, B. (2008). Kernel measures of conditional dependence.

• Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B. (2009). A fast, consistent kernel two-sample

test. NIPS.

• Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR

Energy distance, relation to kernel distances

• Sejdinovic, D., Sriperumbudur, B., Gretton, A.,, Fukumizu, K., (2013). Equivalence of distance-based and

rkhs-based statistics in hypothesis testing. Annals of Statistics.

Three way interaction

• Sejdinovic, D., Gretton, A., and Bergsma, W. (2013). A Kernel Test for Three-Variable Interactions. NIPS.

Selected references (continued)

Conditional mean embedding, RKHS-valued regression:

• Weston, J., Chapelle, O., Elisseeff, A., Scholkopf, B., and Vapnik, V., (2003). Kernel Dependency

Estimation, NIPS.

• Micchelli, C., and Pontil, M., (2005). On Learning Vector-Valued Functions. Neural Computation.

• Caponnetto, A., and De Vito, E. (2007). Optimal Rates for the Regularized Least-Squares Algorithm.

Foundations of Computational Mathematics.

• Song, L., and Huang, J., and Smola, A., Fukumizu, K., (2009). Hilbert Space Embeddings of Conditional

Distributions. ICML.

• Grunewalder, S., Lever, G., Baldassarre, L., Patterson, S., Gretton, A., Pontil, M. (2012). Conditional mean

embeddings as regressors. ICML.

• Grunewalder, S., Gretton, A., Shawe-Taylor, J. (2013). Smooth operators. ICML.

Kernel Bayes rule:

• Song, L., Fukumizu, K., Gretton, A. (2013). Kernel embeddings of conditional distributions: A unified

kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine.

• Fukumizu, K., Song, L., Gretton, A. (2013). Kernel Bayes rule: Bayesian inference with positive definite

kernels, JMLR

References

L.

Barin

ghaus

and

C.

Franz.

On

anew

multiv

aria

te

two-s

am

ple

test.

J.

Multiv

ariate

Anal.,

88:1

90–206,2004.

C.Berg,J.P.R.Christensen,and

P.Ressel.

Harmonic

Analysis

on

Semigro

ups.

Sprin

ger,New

York,1984.

Andrey

Feuerverger.A

consistenttestfo

rbiv

aria

te

dependence.In

ternatio

nal

Sta

tistic

alReview,61(3):4

19–433,1993.

K.Fukum

izu,F.R.Bach,and

M.I.

Jordan.

Dim

ensio

nalit

yreductio

nfo

rsupervised

learnin

gwith

reproducin

gkernel

Hilb

ert

spaces.

Journalof

Machin

eLearnin

gResearch,5:7

3–99,2004.

K.Fukum

izu,

A.G

retton,

X.Sun,

and

B.Scholk

opf.

Kernelm

easures

of

conditio

naldependence.In

Advancesin

Neura

lIn

formatio

nProcessin

gSystems

20,pages

489–496,Cam

brid

ge,M

A,2008.M

ITPress.

A.G

retton

and

L.G

yorfi.

Consistent

nonparam

etric

tests

ofin

dependence.

JournalofM

achin

eLearnin

gResearch,11:1

391–1423,2010.

A.G

retton,O

.Bousquet,A.J.Sm

ola

,and

B.Scholk

opf.

Measurin

gstatisti-

caldependencewith

Hilb

ert-S

chm

idtnorm

s.In

S.Jain

,H.U.Sim

on,and

E.Tom

ita,editors,Proceedin

gsofth

eIn

ternatio

nalConference

on

Algorith

mic

Learnin

gTheory,pages

63–77.Sprin

ger-V

erla

g,2005.

A.G

retton,K

.Borgwardt,M

.Rasch,B.Scholk

opf,

and

A.J.Sm

ola

.A

ker-

nelm

ethod

for

the

two-s

am

ple

proble

m.

InAdvancesin

Neura

lIn

formatio

n

Processin

gSystems15,pages

513–520,Cam

brid

ge,M

A,2007.M

ITPress.

A.G

retton,K

.Fukum

izu,C.-H

.Teo,L.Song,B.Scholk

opf,

and

A.J.Sm

ola

.A

kernelstatistic

altestofin

dependence.In

Advancesin

Neura

lIn

formatio

n

Processin

gSystems20,pages

585–592,Cam

brid

ge,M

A,2008.M

ITPress.

A.G

retton,K

.Borgwardt,M

.Rasch,B.Schoelk

opf,

and

A.Sm

ola

.A

kernel

two-s

am

ple

test.

JM

LR,13:7

23–773,2012a.

A.G

retton,B.Srip

erum

budur,D

.Sejd

inovic

,H.Strathm

ann,S.Bala

krish-

nan,M

.Pontil,

and

K.Fukum

izu.

Optim

alkernelchoic

efo

rla

rge-s

cale

two-s

am

ple

tests.

InNIP

S,2012b.

T.Read

and

N.Cressie

.Goodness-O

f-Fit

Sta

tistic

sforDiscrete

Multiv

ariate

Anal-

ysis.

Sprin

ger-V

erla

g,New

York,1988.

104-1

A.J.Sm

ola

,A.G

retton,L.Song,and

B.Scholk

opf.

AHilb

ert

space

em

-beddin

gfo

rdistrib

utio

ns.

InProceedin

gs

ofth

eIn

ternatio

nalConference

on

Algorith

mic

Learnin

gTheory,volu

me

4754,pages

13–31.Sprin

ger,2007.

G.Szekely

and

M.Riz

zo.Testin

gfo

requaldistrib

utio

nsin

hig

hdim

ensio

n.

InterSta

t,5,2004.

G.Szekely

and

M.Riz

zo.A

new

testfo

rm

ultiv

aria

tenorm

alit

y.J.M

ultiv

ariate

Anal.,

93:5

8–80,2005.

G.Szekely

and

M.Riz

zo.

Brownia

ndistance

covaria

nce.

Annals

ofApplie

d

Sta

tistic

s,4(3):1

233–1303,2009.

G.Szekely

,M

.Riz

zo,and

N.Bakirov.M

easurin

gand

testin

gdependence

by

correla

tio

nofdistances.

Ann.Sta

t.,35(6):2

769–2794,2007.

K.Zhang,J.Peters,D

.Janzin

g,B.,

and

B.Scholk

opf.

Kernel-b

ased

con-

ditio

nalin

dependence

test

and

applic

atio

nin

causaldiscovery.

In27th

Conferenceon

Uncerta

inty

inArtifi

cialIn

tellig

ence,pages

804–813,2011.

104-2

kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... ·...

Documents