kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... ·...

173
Kernel nonparametric tests of homogeneity, independence, and multi-variable interaction Imperial, 2015 Arthur Gretton Gatsby Unit, CSML, UCL

Upload: others

Post on 27-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Kernel nonparametric tests ofhomogeneity, independence, and

multi-variable interaction

Imperial, 2015

Arthur Gretton

Gatsby Unit, CSML, UCL

Page 2: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Motivating example: detect differences in AM signals

Samples from P Samples from Q

Page 3: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Case of discrete domains

• How do you compare distributions. . .

• . . .in a discrete domain? [Read and Cressie, 1988]

Page 4: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Case of discrete domains

• How do you compare distributions. . .

• . . .in a discrete domain? [Read and Cressie, 1988]

X1: Now disturbing reports out of Newfound-

land show that the fragile snow crab industry is

in serious decline. First the west coast salmon,

the east coast salmon and the cod, and now the

snow crabs off Newfoundland.

Y1: Honourable senators, I have a question for

the Leader of the Government in the Senate with

regard to the support funding to farmers that has

been announced. Most farmers have not received

any money yet.

X2: To my pleasant surprise he responded that

he had personally visited those wharves and that

he had already announced money to fix them.

What wharves did the minister visit in my riding

and how much additional funding is he going to

provide for Delaps Cove, Hampton, Port Lorne,

· · ·

?

PX = PY

Y2:On the grain transportation system we have

had the Estey report and the Kroeger report.

We could go on and on. Recently programs have

been announced over and over by the government

such as money for the disaster in agriculture on

the prairies and across Canada.

· · ·

Are the pink extracts from the same distribution as the gray ones?

Page 5: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Detecting statistical dependence

• How do you detect dependence. . .

• . . .in a discrete domain? [Read and Cressie, 1988]

Page 6: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Detecting statistical dependence

• How do you detect dependence. . .

• . . .in a discrete domain? [Read and Cressie, 1988]

Page 7: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Detecting statistical dependence

• How do you detect dependence. . .

• . . .in a discrete domain? [Read and Cressie, 1988]

!"#$%&' ()'*+,' -./,'

#0.1+' 2345' 2326'

78'.0.1+' 2325' 2396'

Page 8: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Detecting statistical dependence

• How do you detect dependence. . .

• . . .in a discrete domain? [Read and Cressie, 1988]

!"#$%&' ()'*+,' -./,'

#0.1+' 2342' 2352'

67'.0.1+' 2358' 2389'

Page 9: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Detecting statistical dependence, discrete domain

• How do you detect dependence. . .

• . . .in a discrete domain? [Read and Cressie, 1988]

X1: Honourable senators, I have a ques-

tion for the Leader of the Government in the

Senate with regard to the support funding

to farmers that has been announced. Most

farmers have not received any money yet.

Y1: Honorables senateurs, ma question

s’adresse au leader du gouvernement au

Senat et concerne l’aide financiere qu’on a

annoncee pour les agriculteurs. La plupart

des agriculteurs n’ont encore rien reu de cet

argent.

X2: No doubt there is great pressure on

provincial and municipal governments in re-

lation to the issue of child care, but the re-

ality is that there have been no cuts to child

care funding from the federal government to

the provinces. In fact, we have increased

federal investments for early childhood de-

velopment.

· · ·

?

PXY = PXPY

Y2:Il est evident que les ordres de

gouvernements provinciaux et municipaux

subissent de fortes pressions en ce qui con-

cerne les services de garde, mais le gou-

vernement n’a pas reduit le financement

qu’il verse aux provinces pour les services de

garde. Au contraire, nous avons augmente le

financement federal pour le developpement

des jeunes enfants.

· · ·

Are the French text extracts translations of the English ones?

Page 10: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Detecting statistical dependence, continuous domain

• How do you detect dependence. . .

• . . .in a continuous domain?

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

X

Y

Sample from PXY

ր?

ց

Dependent PXY

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

Independent PXY

=PX P

Y

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

Page 11: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Detecting statistical dependence, continuous domain

• How do you detect dependence. . .

• . . .in a continuous domain?

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

X

Y

Sample from PXY

ր?

ց

Discretized empirical PXY

Discretized empirical PX P

Y

Page 12: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Detecting statistical dependence, continuous domain

• How do you detect dependence. . .

• . . .in a continuous domain?

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

X

Y

Sample from PXY

ր?

ց

Discretized empirical PXY

Discretized empirical PX P

Y

Page 13: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Detecting statistical dependence, continuous domain

• How do you detect dependence. . .

• . . .in a continuous domain?

• Problem: fails even in “low” dimensions! [NIPS07a, ALT08]

– X and Y in R4, statistic=Power divergence, samples= 1024, cases

where dependence detected=0/500

• Too few points per bin

Page 14: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Outline

• An RKHS metric on the space of probability measures

– Distance between means in space of features (RKHS)

– Nonparametric two-sample test...

– ...for (almost!) any data type: strings, images, graphs, groups

(rotation matrices), semigroups (histograms),...

– Two-sample testing for big data, how to choose the kernel [NIPS12]

• Dependence detection

– Distance between joint Pxy and product of marginals PxPy

– Testing three-way interactions [NIPS13]

• Nonparametric Bayesian inference

Page 15: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Feature mean difference

• Simple example: 2 Gaussians with different means

• Answer: t-test

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Two Gaussians with different means

X

Pro

b. density

P

X

QX

Page 16: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Feature mean difference

• Two Gaussians with same means, different variance

• Idea: look at difference in means of features of the RVs

• In Gaussian case: second order features of form ϕ(x) = x2

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Two Gaussians with different variances

Pro

b.

de

nsity

X

P

X

QX

Page 17: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Feature mean difference

• Two Gaussians with same means, different variance

• Idea: look at difference in means of features of the RVs

• In Gaussian case: second order features of form ϕx = x2

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Two Gaussians with different variances

Pro

b.

de

nsity

X

P

X

QX

10−1

100

101

102

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Densities of feature X2

X2

Pro

b.

de

nsity

P

X

QX

Page 18: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Feature mean difference

• Gaussian and Laplace distributions

• Same mean and same variance

• Difference in means using higher order features

−4 −3 −2 −1 0 1 2 3 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Gaussian and Laplace densities

Pro

b. density

X

P

X

QX

. . . so let’s explore feature representations!

Page 19: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Kernels: similarity between features

Kernel:

• We have two objects x and x′ from a set X (documents, images, . . .).

How similar are they?

Page 20: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Kernels: similarity between features

Kernel:

• We have two objects x and x′ from a set X (documents, images, . . .).

How similar are they?

• Define features of objects:

– ϕx are features of x,

– ϕx′ are features of x′

• A kernel is the dot product between these features:

k(x, x′) := 〈ϕx, ϕx′〉F .

Page 21: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Good features make problems easy: XOR example

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

x1

x2

• No linear classifier separates red from blue

• Map points to higher dimensional feature space:

φx =[x1 x2 x1x2

]∈ R3

Page 22: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Good features make problems easy: XOR example

Example: A three dimensional space of features of points in R2:

ϕ : R2 → R

3

x =

x1

x2

7→ ϕx =

x1

x2

x1x2

,

with kernel

k(x, y) =

x1

x2

x1x2

y1

y2

y1y2

(the standard inner product in R3 between features). Denote this feature

space by F .

Page 23: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Functions of features

Define a linear function of the inputs x1, x2, and their product x1x2,

f(x) = f1x1 + f2x2 + f3x1x2.

f in a space of functions mapping from X = R2 to R. Equivalent

representation for f ,

f =[f1 f2 f3

]⊤.

f refers to the function as an object (here as a vector in R3)

f(x) ∈ R is function evaluated at a point (a real number).

Page 24: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Functions of features

Define a linear function of the inputs x1, x2, and their product x1x2,

f(x) = f1x1 + f2x2 + f3x1x2.

f in a space of functions mapping from X = R2 to R. Equivalent

representation for f ,

f =[f1 f2 f3

]⊤.

f refers to the function as an object (here as a vector in R3)

f(x) ∈ R is function evaluated at a point (a real number).

f(x) = f⊤ϕx = 〈f, ϕx〉FEvaluation of f at x is an inner product in feature space (here standard

inner product in R3)

F is a space of functions mapping R2 to R.

Page 25: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

The reproducing property

This example illustrates the two defining features of an RKHS:

• The reproducing property: ∀x ∈ X , ∀f ∈ F , 〈f, ϕx〉F = f(x)

• In particular, for any x, y ∈ X ,

k(x, y) = 〈ϕx, ϕy〉F .

Note: the feature map of every point is in the feature space: ∀x ∈ X , ϕx ∈ F

Page 26: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Infinite dimensional feature space

Reproducing property for function with Gaussian kernel:

f(x) :=∑m

i=1 αik(xi, x) = 〈∑mi=1 αiϕxi , ϕx〉F .

−6 −4 −2 0 2 4 6 8−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x

f(x)

Page 27: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Infinite dimensional feature space

Reproducing property for function with Gaussian kernel:

f(x) :=∑m

i=1 αik(xi, x) = 〈∑mi=1 αiϕxi , ϕx〉F .

−6 −4 −2 0 2 4 6 8−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x

f(x)

• What do the features ϕx look like (there are infinitely many of them!)

• What do these features have to do with smoothness?

Page 28: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Infinite dimensional feature space

Gaussian kernel, k(x, x′) = exp(−‖x−x′‖2

2σ2

),

λk ∝ bk b < 1

ek(x) ∝ exp(−(c− a)x2)Hk(x√2c),

a, b, c are functions of σ, and Hk is kth order Hermite polynomial.

k(x, x′)

=∞∑

i=1

λiei(x)ei(x′)

=∞∑

i=1

(√λiei(x)

)(√λiei(x

′))

=∞∑

i=1

ϕxϕx′

Page 29: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Infinite dimensional feature space

Example RKHS function, Gaussian kernel:

f(x) :=m∑

i=1

αik(xi, x) =m∑

i=1

αi

∞∑

j=1

λjej(xi)ej(x)

=

∞∑

j=1

fj

[√λjej(x)

]

where fj =∑m

i=1 αi

√λjej(xi).

−6 −4 −2 0 2 4 6 8−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x

f(x)

NOTE that this

enforces smoothing:

λj decay as ej

become rougher,

fj decay since

‖f‖2F =∑

j f2j <∞.

Page 30: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Probabilities in feature space: the mean trick

The kernel trick

• Given x ∈ X for some set X ,

define feature map ϕx ∈ F ,

ϕx =[. . .√λiei(x) . . .

]∈ ℓ2

• For positive definite k(x, x′),

k(x, x′) = 〈ϕx, ϕx′〉F

• The kernel trick: ∀f ∈ F ,

f(x) = 〈f, ϕx〉F

Page 31: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Probabilities in feature space: the mean trick

The kernel trick

• Given x ∈ X for some set X ,

define feature map ϕx ∈ F ,

ϕx =[. . .√λiei(x) . . .

]∈ ℓ2

• For positive definite k(x, x′),

k(x, x′) = 〈ϕx, ϕx′〉F

• The kernel trick: ∀f ∈ F ,

f(x) = 〈f, ϕx〉F

The mean trick

• Given P a Borel probability

measure on X , define feature

map µP ∈ F

µP =[. . .√λiEP [ei(X)] . . .

]∈ ℓ2

• For positive definite k(x, x′),

EP,Qk(X,Y ) = 〈µP, µQ〉F

for X ∼ P and Y ∼ Q.

• The mean trick: (we call µP a

mean/distribution embedding)

EP(f(X)) =: 〈µP, f〉F

Page 32: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Feature embeddings of probabilities

The kernel trick:

f(x) = 〈f, ϕx〉F

The mean trick:

EP(f(X)) = 〈f, µP〉FEmpirical mean embedding:

µP = m−1m∑

i=1

ϕxi xii.i.d.∼ P

µP gives you expectations of all RKHS functions

When k characteristic, then µP unique, e.g. Gauss, Laplace, . . .

Page 33: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

The maximum mean discrepancy

The maximum mean discrepancy is the squared distance between feature

means:

MMD(P,Q) = ‖µP − µQ‖2F = 〈µP, µP〉F + 〈µQ, µQ〉F − 2 〈µP, µQ〉F= EPk(x, x

′) +EQk(y, y′)− 2EP,Qk(x, y)

Page 34: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

The maximum mean discrepancy

The maximum mean discrepancy is the squared distance between feature

means:

MMD(P,Q) = ‖µP − µQ‖2F = 〈µP, µP〉F + 〈µQ, µQ〉F − 2 〈µP, µQ〉F= EPk(x, x

′) +EQk(y, y′)− 2EP,Qk(x, y)

An unbiased empirical estimate: for {xi}mi=1 ∼ P and {yi}mi=1 ∼ Q,

MMD = 1m(m−1)

∑mi=1

∑mj 6=i k(xi, xj)

+ 1m(m−1)

∑mi=1

∑mj 6=i k(yi, yj)

−2∑m

i=1

∑mj=1 k(yi, xj)

Page 35: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Maximum mean discrepancy: main idea

KP,P =

k(X1, X1) . . . k(X1, Xn)...

. . ....

k(Xn, X1) . . . k(Xn, Xn)

=

KQ,Q =

k(Y1, Y1) . . . k(Y1, Yn)...

. . ....

k(Yn, Y1) . . . k(Yn, Yn)

=

KP ,Q =

k(X1, Y1) . . . k(X1, Yn)...

. . ....

k(Xn, Y1) . . . k(Xn, Yn)

=

Page 36: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Maximum mean discrepancy: main idea

P 6= Q⇒

KP,P KP ,Q

KQ,P KQ,Q

=

P = Q⇒

KP,P KP,Q

KQ,P KQ,Q

=

Page 37: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Statistical hypothesis testing

Page 38: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Statistical test using MMD

• Two hypotheses:

– H0: null hypothesis (P = Q)

– H1: alternative hypothesis (P 6= Q)

Page 39: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Statistical test using MMD

• Two hypotheses:

– H0: null hypothesis (P = Q)

– H1: alternative hypothesis (P 6= Q)

• Observe samples x := {x1, . . . , xm} from P and y from Q

• If empirical MMD2is

– “far from zero”: reject H0

– “close to zero”: accept H0

Page 40: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Behavior of empirical MMD

Vn := MMD = KP,P +KQ,Q − 2KP,Q

Page 41: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Statistical test using MMD

• When P = Q, U-statistic degenerate: [Gretton et al., 2007, 2012a]

• Distribution is

mMMD2 ∼

∞∑

l=1

λl[z2l − 2

]

• where

– zl ∼ N (0, 2) i.i.d

–∫X k(x, x

′)︸ ︷︷ ︸centred

ψi(x)dP(x) = λiψi(x′)

−0.04 −0.02 0 0.02 0.04 0.06 0.08 0.10

10

20

30

40

50

MMD density under H0

MMD

Pro

b.

de

nsity

Empirical PDF

Page 42: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Approximate distribution of MMD via permutation

w = (−1, 1, 1, · · · ,−1,−1,−1)⊤

KP,P KP,Q

KQ,P KQ,Q

[ww⊤]

=[?]

⊙ =

⊙ =

Page 43: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Approximate distribution of MMD via permutation

Vn,p := MMDp =

KP,P KP ,Q

KQ,P KQ,Q

[ww⊤]

Page 44: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Experiment: amplitude modulated signals

Samples from P Samples from Q

Page 45: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Results: AM signals

−0.2 0 0.2 0.4 0.6 0.8 1 1.20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

added noise

Typ

e I

I e

rro

r

max ratio

opt

med

l2

maxmmd

m = 10, 000 and scaling a = 0.5. Average over 4124 trials. Gaussian noise

added.

Linear-cost test and kernel choice approach from Gretton et al. [2012b]

Page 46: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Kernel two-sample tests for big data, optimal kernel choice

Page 47: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Quadratic time estimate of MMD

MMD2 = ‖µP − µQ‖2F = EPk(x, x′) +EQk(y, y

′)− 2EP,Qk(x, y)

Page 48: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Quadratic time estimate of MMD

MMD2 = ‖µP − µQ‖2F = EPk(x, x′) +EQk(y, y

′)− 2EP,Qk(x, y)

Given i.i.d. X := {x1, . . . , xm} and Y := {y1, . . . , ym} from P,Q,

respectively:

The earlier estimate: (quadratic time)

EPk(x, x′) =

1

m(m− 1)

m∑

i=1

m∑

j 6=i

k(xi, xj)

Page 49: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Quadratic time estimate of MMD

MMD2 = ‖µP − µQ‖2F = EPk(x, x′) +EQk(y, y

′)− 2EP,Qk(x, y)

Given i.i.d. X := {x1, . . . , xm} and Y := {y1, . . . , ym} from P,Q,

respectively:

The earlier estimate: (quadratic time)

EPk(x, x′) =

1

m(m− 1)

m∑

i=1

m∑

j 6=i

k(xi, xj)

New, linear time estimate:

EPk(x, x′) =

2

m[k(x1, x2) + k(x3, x4) + . . .]

=2

m

m/2∑

i=1

k(x2i−1, x2i)

Page 50: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Linear time MMD

Shorter expression with explicit k dependence:

MMD2 =: ηk(p, q) = Exx′yy′hk(x, x′, y, y′) =: Evhk(v),

where

hk(x, x′, y, y′) = k(x, x′) + k(y, y′)− k(x, y′)− k(x′, y),

and v := [x, x′, y, y′].

Page 51: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Linear time MMD

Shorter expression with explicit k dependence:

MMD2 =: ηk(p, q) = Exx′yy′hk(x, x′, y, y′) =: Evhk(v),

where

hk(x, x′, y, y′) = k(x, x′) + k(y, y′)− k(x, y′)− k(x′, y),

and v := [x, x′, y, y′].

The linear time estimate again:

ηk =2

m

m/2∑

i=1

hk(vi),

where vi := [x2i−1, x2i, y2i−1, y2i] and

hk(vi) := k(x2i−1, x2i) + k(y2i−1, y2i)− k(x2i−1, y2i)− k(x2i, y2i−1)

Page 52: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Linear time vs quadratic time MMD

Disadvantages of linear time MMD vs quadratic time MMD

• Much higher variance for a given m, hence. . .

• . . .a much less powerful test for a given m

Page 53: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Linear time vs quadratic time MMD

Disadvantages of linear time MMD vs quadratic time MMD

• Much higher variance for a given m, hence. . .

• . . .a much less powerful test for a given m

Advantages of the linear time MMD vs quadratic time MMD

• Very simple asymptotic null distribution (a Gaussian, vs an infinite

weighted sum of χ2)

• Both test statistic and threshold computable in O(m), with storage O(1).

• Given unlimited data, a given Type II error can be attained with less

computation

Page 54: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Asymptotics of linear time MMD

By central limit theorem,

m1/2 (ηk − ηk(p, q))D→ N (0, 2σ2k)

• assuming 0 < E(h2k) <∞ (true for bounded k)

• σ2k = Evh2k(v)− [Ev(hk(v))]

2 .

Page 55: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Hypothesis test

Hypothesis test of asymptotic level α:

tk,α = m−1/2σk√2Φ−1(1− α) where Φ−1 is inverse CDF of N (0, 1).

−4 −2 0 2 4 6 80

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4Null distribution, linear time MMD

2

= ηk

P(η

k)

ηk

Type I error

tk,α = (1 − α) quantile

Page 56: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Type II error

−4 −2 0 2 4 6 80

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

P(η

k)

ηk

Null vs alternative distribution, P (ηk)

null

alternative

Type II error ηk(p, q )

Page 57: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

The best kernel: minimizes Type II error

Type II error: ηk falls below the threshold tk,α and ηk(p, q) > 0.

Prob. of a Type II error:

P (ηk < tk,α) = Φ

(Φ−1(1− α)− ηk(p, q)

√m

σk√2

)

where Φ is a Normal CDF.

Page 58: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

The best kernel: minimizes Type II error

Type II error: ηk falls below the threshold tk,α and ηk(p, q) > 0.

Prob. of a Type II error:

P (ηk < tk,α) = Φ

(Φ−1(1− α)− ηk(p, q)

√m

σk√2

)

where Φ is a Normal CDF.

Since Φ monotonic, best kernel choice to minimize Type II error prob. is:

k∗ = argmaxk∈K

ηk(p, q)σ−1k ,

where K is the family of kernels under consideration.

Page 59: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Learning the best kernel in a family

Define the family of kernels as follows:

K :=

{k : k =

d∑

u=1

βuku, ‖β‖1 = D, βu ≥ 0, ∀u ∈ {1, . . . , d}}.

Properties: if at least one βu > 0

• all k ∈ K are valid kernels,

• If all ku charateristic then k characteristic

Page 60: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Test statistic

The squared MMD becomes

ηk(p, q) = ‖µk(p)− µk(q)‖2Fk=

d∑

u=1

βuηu(p, q),

where ηu(p, q) := Evhu(v).

Page 61: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Test statistic

The squared MMD becomes

ηk(p, q) = ‖µk(p)− µk(q)‖2Fk=

d∑

u=1

βuηu(p, q),

where ηu(p, q) := Evhu(v).

Denote:

• β = (β1, β2, . . . , βd)⊤ ∈ Rd,

• h = (h1, h2, . . . , hd)⊤ ∈ Rd,

– hu(x, x′, y, y′) = ku(x, x

′) + ku(y, y′)− ku(x, y

′)− ku(x′, y)

• η = Ev(h) = (η1, η2, . . . , ηd)⊤ ∈ Rd.

Quantities for test:

ηk(p, q) = E(β⊤h) = β⊤η σ2k := β⊤cov(h)β.

Page 62: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Optimization of ratio ηk(p, q)σ−1k

Empirical test parameters:

ηk = β⊤η σk,λ =

√β⊤(Q+ λmI

)β,

Q is empirical estimate of cov(h).

Note: ηk, σk,λ computed on training data, vs ηk, σk on data to be tested

(why?)

Page 63: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Optimization of ratio ηk(p, q)σ−1k

Empirical test parameters:

ηk = β⊤η σk,λ =

√β⊤(Q+ λmI

)β,

Q is empirical estimate of cov(h).

Note: ηk, σk,λ computed on training data, vs ηk, σk on data to be tested

(why?)

Objective:

β∗ = argmaxβ�0

ηk(p, q)σ−1k,λ

= argmaxβ�0

(β⊤η

)(β⊤(Q+ λmI

)β)−1/2

=: α(β; η, Q)

Page 64: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Optmization of ratio ηk(p, q)σ−1k

Assume: η has at least one positive entry

Then there exists β � 0 s.t. α(β; η, Q) > 0.

Thus: α(β∗; η, Q) > 0

Page 65: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Optmization of ratio ηk(p, q)σ−1k

Assume: η has at least one positive entry

Then there exists β � 0 s.t. α(β; η, Q) > 0.

Thus: α(β∗; η, Q) > 0

Solve easier problem: β∗ = argmaxβ�0 α2(β; η, Q).

Quadratic program:

min{β⊤(Q+ λmI

)β : β⊤η = 1, β � 0}

Page 66: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Optmization of ratio ηk(p, q)σ−1k

Assume: η has at least one positive entry

Then there exists β � 0 s.t. α(β; η, Q) > 0.

Thus: α(β∗; η, Q) > 0

Solve easier problem: β∗ = argmaxβ�0 α2(β; η, Q).

Quadratic program:

min{β⊤(Q+ λmI

)β : β⊤η = 1, β � 0}

What if η has no positive entries?

Page 67: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Test procedure

1. Split the data into testing and training.

2. On the training data:

(a) Compute ηu for all ku ∈ K(b) If at least one ηu > 0, solve the QP to get β∗, else choose random

kernel from K3. On the test data:

(a) Compute ηk∗ using k∗ =∑d

u=1 β∗ku

(b) Compute test threshold tα,k∗ using σk∗

4. Reject null if ηk∗ > tα,k∗

Page 68: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Convergence bounds

Assume bounded kernel, σk, bounded away from 0.

If λm = Θ(m−1/3) then

∣∣∣∣supk∈K

ηkσ−1k,λ − sup

k∈Kηkσ

−1k

∣∣∣∣ = OP

(m−1/3

).

Page 69: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Convergence bounds

Assume bounded kernel, σk, bounded away from 0.

If λm = Θ(m−1/3) then

∣∣∣∣supk∈K

ηkσ−1k,λ − sup

k∈Kηkσ

−1k

∣∣∣∣ = OP

(m−1/3

).

Idea:

∣∣∣∣supk∈K

ηkσ−1k,λ − sup

k∈Kηkσ

−1k

∣∣∣∣

≤ supk∈K

∣∣∣ηkσ−1k,λ − ηkσ

−1k,λ

∣∣∣+ supk∈K

∣∣∣ηkσ−1k,λ − ηkσ

−1k

∣∣∣

≤√d

D√λm

(C1 sup

k∈K|ηk − ηk|+ C2 sup

k∈K|σk,λ − σk,λ|

)+ C3D

2λm,

Page 70: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Experiments

Page 71: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Competing approaches

• Median heuristic

• Max. MMD: choose ku ∈ K with the largest ηu

– same as maximizing β⊤η subject to ‖β‖1 ≤ 1

• ℓ2 statistic: maximize β⊤η subject to ‖β‖2 ≤ 1

• Cross validation on training set

Also compare with:

• Single kernel that maximizes ratio ηk(p, q)σ−1k

Page 72: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Blobs: data

Difficult problems: lengthscale of the difference in distributions not the same

as that of the distributions.

Page 73: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Blobs: data

Difficult problems: lengthscale of the difference in distributions not the same

as that of the distributions.

We distinguish a field of Gaussian blobs with different covariances.

5 10 15 20 25 30 355

10

15

20

25

30

35Blob data p

x1

x2

5 10 15 20 25 30 355

10

15

20

25

30

35Blob data q

y1

y2

Ratio ε = 3.2 of largest to smallest eigenvalues of blobs in q.

Page 74: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Blobs: results

0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ε ratio

Typ

eII

erro

r

max ratio

opt

l2

maxmmd

xval

xvalc

med

Parameters: m = 10, 000 (for training and test). Ratio ε of largest to

smallest eigenvalues of blobs in q. Results are average over 617 trials.

Page 75: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Blobs: results

0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ε ratio

Type II err

or

max ratiooptl2maxmmdxvalxvalcmed

Optimize ratio ηk(p, q)σ−1k

Page 76: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Blobs: results

0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ε ratio

Type II err

or

max ratiooptl2maxmmdxvalxvalcmed

Maximize ηk(p, q) with β constraint

Page 77: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Blobs: results

0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ε ratio

Type II err

or

max ratiooptl2maxmmdxvalxvalcmed

Median heuristic

Page 78: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Feature selection: data

Idea: no single best kernel.

Each of the ku are univariate (along a single coordinate)

Page 79: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Feature selection: data

Idea: no single best kernel.

Each of the ku are univariate (along a single coordinate)

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

8Selection data

x1

x2

p

q

Page 80: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Feature selection: results

0 5 10 15 20 25 300.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

Feature selection

dimension

Typ

e I

I e

rro

r

max ratio

opt

l2

maxmmd

Single best kernel

Linear combination

m = 10, 000, average over 5000 trials

Page 81: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Amplitude modulated signals

Given an audio signal s(t), an amplitude modulated signal can be defined

u(t) = sin(ωct) [a s(t) + l]

• ωc: carrier frequency

• a = 0.2 is signal scaling, l = 2 is offset

Page 82: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Amplitude modulated signals

Given an audio signal s(t), an amplitude modulated signal can be defined

u(t) = sin(ωct) [a s(t) + l]

• ωc: carrier frequency

• a = 0.2 is signal scaling, l = 2 is offset

Two amplitude modulated signals from same artist (in this case, Magnetic

Fields).

• Music sampled at 8KHz (very low)

• Carrier frequency is 24kHz

• AM signal observed at 120kHz

• Samples are extracts of length N = 1000, approx. 0.01 sec (very short).

• Total dataset size is 30,000 samples from each of p, q.

Page 83: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Amplitude modulated signals

Samples from P Samples from Q

Page 84: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Results: AM signals

−0.2 0 0.2 0.4 0.6 0.8 1 1.20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

added noise

Typ

e I

I e

rro

r

max ratio

opt

med

l2

maxmmd

m = 10, 000 (for training and test) and scaling a = 0.5. Average over 4124

trials. Gaussian noise added.

Page 85: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Observations on kernel choice

• It is possible to choose the best kernel for a kernel

two-sample test

• Kernel choice matters for “difficult” problems, where the

distributions differ on a lengthscale different to that of

the data.

• Ongoing work:

– quadratic time statistic

– avoid training/test split

Page 86: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Testing Independence

Page 87: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

MMD for independence

• Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05,

NIPS07a, ALT07, ALT08, JMLR10]

Related to [Feuerverger, 1993]and [Szekely and Rizzo, 2009, Szekely et al., 2007]

HSIC(PXY ,PXPY ) := ‖µPXY− µPXPY

‖2

Page 88: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

MMD for independence

• Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05,

NIPS07a, ALT07, ALT08, JMLR10]

Related to [Feuerverger, 1993]and [Szekely and Rizzo, 2009, Szekely et al., 2007]

HSIC(PXY ,PXPY ) := ‖µPXY− µPXPY

‖2

k( , )!" #" !"l( , )#"

k( , )× l( , )!" #" !" #"

κ( , ) =!" #"!" #"

Page 89: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

MMD for independence

• Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05,

NIPS07a, ALT07, ALT08, JMLR10]

Related to [Feuerverger, 1993]and [Szekely and Rizzo, 2009, Szekely et al., 2007]

HSIC(PXY ,PXPY ) := ‖µPXY− µPXPY

‖2

HSIC using expectations of kernels:

Define RKHS F on X with kernel k, RKHS G on Y with kernel l. Then

HSIC(PXY ,PXPY )

= EXY EX′Y ′k(x, x′)l(y, y′) +EXEX′k(x, x′)EY EY ′ l(y, y′)

− 2EX′Y ′

[EXk(x, x

′)EY l(y, y′)].

Page 90: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Experiment: dependence testing for translation

• Translation example: [NIPS07b]

Canadian Hansard

(agriculture)

• 5-line extracts,

k-spectrum kernel, k = 10,

repetitions=300,

sample size 10

• Empirical

MMD(PXY ,PXPY ):

1

n2(HKH ◦HLH)++

K

⇒MMD⇐

L

• k-spectrum kernel: average Type II error 0 (α = 0.05)

• Bag of words kernel: average Type II error 0.18

Page 91: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Lancaster (3-way) Interactions

Page 92: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Detecting a higher order interaction

• How to detect V-structures with pairwise weak individual dependence?

X Y

Z

Page 93: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Detecting a higher order interaction

• How to detect V-structures with pairwise weak individual dependence?

Page 94: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Detecting a higher order interaction

• How to detect V-structures with pairwise weak individual dependence?

• X ⊥⊥ Y , Y ⊥⊥ Z, X ⊥⊥ Z

X vs Y Y vs Z

X vs Z XY vs Z

X Y

Z

• X,Yi.i.d.∼ N (0, 1),

• Z| X,Y ∼ sign(XY )Exp( 1√2)

Faithfulness violated here

Page 95: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

V-structure Discovery

X Y

Z

Assume X ⊥⊥ Y has been established. V-structure can then be detected by:

• Consistent CI test: H0 : X ⊥⊥ Y |Z [Fukumizu et al., 2008, Zhang et al., 2011], or

Page 96: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

V-structure Discovery

X Y

Z

Assume X ⊥⊥ Y has been established. V-structure can then be detected by:

• Consistent CI test: H0 : X ⊥⊥ Y |Z [Fukumizu et al., 2008, Zhang et al., 2011], or

• Factorisation test: H0 : (X,Y ) ⊥⊥ Z ∨ (X,Z) ⊥⊥ Y ∨ (Y, Z) ⊥⊥ X

(multiple standard two-variable tests)

– compute p-values for each of the marginal tests for (Y, Z) ⊥⊥ X,

(X,Z) ⊥⊥ Y , or (X,Y ) ⊥⊥ Z

– apply Holm-Bonferroni (HB) sequentially rejective correction(Holm 1979)

Page 97: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

V-structure Discovery (2)

• How to detect V-structures with pairwise weak (or nonexistent)

dependence?

• X ⊥⊥ Y , Y ⊥⊥ Z, X ⊥⊥ Z

X1 vs Y1 Y1 vs Z1

X1 vs Z1 X1*Y1 vs Z1

X Y

Z

• X1, Y1i.i.d.∼ N (0, 1),

• Z1| X1, Y1 ∼ sign(X1Y1)Exp(1√2)

• X2:p, Y2:p, Z2:pi.i.d.∼ N (0, Ip−1) Faith-

fulness violated here

Page 98: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

V-structure Discovery (3)

CI: X ⊥⊥Y |Z

2var: Factor

Null

acc

epta

nce

rate

(Typ

eII

erro

r)V-structure discovery: Dataset A

Dimension

1 3 5 7 9 11 13 15 17 19

0

0.2

0.4

0.6

0.8

1

Figure 1: CI test for X ⊥⊥ Y |Z from Zhang et al (2011), and a factorisation test

with a HB correction, n = 500

Page 99: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is

a signed measure ∆P that vanishes whenever P can be factorised in a

non-trivial way as a product of its (possibly multivariate) marginal

distributions.

• D = 2 : ∆LP = PXY − PXPY

Page 100: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is

a signed measure ∆P that vanishes whenever P can be factorised in a

non-trivial way as a product of its (possibly multivariate) marginal

distributions.

• D = 2 : ∆LP = PXY − PXPY

• D = 3 : ∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ

Page 101: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is

a signed measure ∆P that vanishes whenever P can be factorised in a

non-trivial way as a product of its (possibly multivariate) marginal

distributions.

• D = 2 : ∆LP = PXY − PXPY

• D = 3 : ∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ

X Y

Z

X Y

Z

X Y

Z

X Y

Z

PXY Z −PXPY Z −PY PXZ −PZPXY +2PXPY PZ

∆LP =

Page 102: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is

a signed measure ∆P that vanishes whenever P can be factorised in a

non-trivial way as a product of its (possibly multivariate) marginal

distributions.

• D = 2 : ∆LP = PXY − PXPY

• D = 3 : ∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ

X Y

Z

X Y

Z

X Y

Z

X Y

Z

PXY Z −PXPY Z −PY PXZ −PZPXY +2PXPY PZ

∆LP = 0

Case of PX ⊥⊥ PY Z

Page 103: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is

a signed measure ∆P that vanishes whenever P can be factorised in a

non-trivial way as a product of its (possibly multivariate) marginal

distributions.

• D = 2 : ∆LP = PXY − PXPY

• D = 3 : ∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ

(X,Y ) ⊥⊥ Z ∨ (X,Z) ⊥⊥ Y ∨ (Y, Z) ⊥⊥ X ⇒ ∆LP = 0.

...so what might be missed?

Page 104: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is

a signed measure ∆P that vanishes whenever P can be factorised in a

non-trivial way as a product of its (possibly multivariate) marginal

distributions.

• D = 2 : ∆LP = PXY − PXPY

• D = 3 : ∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ

∆LP = 0 ; (X,Y ) ⊥⊥ Z ∨ (X,Z) ⊥⊥ Y ∨ (Y, Z) ⊥⊥ X

Example:

P (0, 0, 0) = 0.2 P (0, 0, 1) = 0.1 P (1, 0, 0) = 0.1 P (1, 0, 1) = 0.1

P (0, 1, 0) = 0.1 P (0, 1, 1) = 0.1 P (1, 1, 0) = 0.1 P (1, 1, 1) = 0.2

Page 105: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

A Test using Lancaster Measure

• Test statistic is empirical estimate of ‖µκ (∆LP )‖2Hκ, where

κ = k ⊗ l ⊗m:

‖µκ(PXY Z − PXY PZ − · · · )‖2Hκ=

〈µκPXY Z , µκPXY Z〉Hκ− 2 〈µκPXY Z , µκPXY PZ〉Hκ

· · ·

Page 106: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Inner Product Estimators

ν\ν′ PXY Z PXY PZ PXZPY PY ZPX PXPY PZ

PXY Z (K ◦ L ◦ M)++ ((K ◦ L)M)++ ((K ◦ M)L)++ ((M ◦ L)K)++ tr(K+ ◦ L+ ◦ M+)

PXY PZ (K ◦ L)++ M++ (MKL)++ (KLM)++ (KL)++M++

PXZPY (K ◦ M)++ L++ (KML)++ (KM)++L++

PY ZPX (L ◦ M)++ K++ (LM)++K++

PXPY PZ K++L++M++

Table 1: V -statistic estimators of 〈µκν, µκν ′〉Hκ

Page 107: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Inner Product Estimators

ν\ν′ PXY Z PXY PZ PXZPY PY ZPX PXPY PZ

PXY Z (K ◦ L ◦ M)++ ((K ◦ L)M)++ ((K ◦ M)L)++ ((M ◦ L)K)++ tr(K+ ◦ L+ ◦ M+)

PXY PZ (K ◦ L)++ M++ (MKL)++ (KLM)++ (KL)++M++

PXZPY (K ◦ M)++ L++ (KML)++ (KM)++L++

PY ZPX (L ◦ M)++ K++ (LM)++K++

PXPY PZ K++L++M++

Table 2: V -statistic estimators of 〈µκν, µκν ′〉Hκ

‖µκ (∆LP )‖2Hκ=

1

n2(HKH ◦HLH ◦HMH)++ .

Empirical joint central moment in the feature space

Page 108: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Example A: factorisation tests

CI: X ⊥⊥Y |Z

∆L: Factor

2var: Factor

Null

acc

epta

nce

rate

(Typ

eII

erro

r)

V-structure discovery: Dataset A

Dimension

1 3 5 7 9 11 13 15 17 19

0

0.2

0.4

0.6

0.8

1

Figure 2: Factorisation hypothesis: Lancaster statistic vs. a two-variable

based test (both with HB correction); Test for X ⊥⊥ Y |Z from Zhang et al

(2011), n = 500

Page 109: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Example B: Joint dependence can be easier to detect

• X1, Y1i.i.d.∼ N (0, 1)

• Z1 =

X21 + ǫ, w.p. 1/3,

Y 21 + ǫ, w.p. 1/3,

X1Y1 + ǫ, w.p. 1/3,

where ǫ ∼ N (0, 0.12).

• X2:p, Y2:p, Z2:pi.i.d.∼ N (0, Ip−1)

• dependence of Z on pair (X,Y ) is stronger than on X and Y individually

• Satisfies faithfulness

Page 110: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Example B: factorisation tests

CI: X ⊥⊥Y |Z

∆L: Factor

2var: Factor

Null

acc

epta

nce

rate

(Typ

eII

erro

r)

V-structure discovery: Dataset B

Dimension

1 3 5 7 9 11 13 15 17 19

0

0.2

0.4

0.6

0.8

1

Figure 3: Factorisation hypothesis: Lancaster statistic vs. a two-variable

based test (both with HB correction); Test for X ⊥⊥ Y |Z from Zhang et al

(2011), n = 500

Page 111: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Interaction for D ≥ 4

• Interaction measure valid for all D

(Streitberg, 1990):

∆SP =∑

π

(−1)|π|−1 (|π| − 1)!JπP

– For a partition π, Jπ associates to

the joint the corresponding

factorisation, e.g.,

J13|2|4P = PX1X3PX2PX4 .

Page 112: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Interaction for D ≥ 4

• Interaction measure valid for all D

(Streitberg, 1990):

∆SP =∑

π

(−1)|π|−1 (|π| − 1)!JπP

– For a partition π, Jπ associates to

the joint the corresponding

factorisation, e.g.,

J13|2|4P = PX1X3PX2PX4 .

Page 113: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Interaction for D ≥ 4

• Interaction measure valid for all D

(Streitberg, 1990):

∆SP =∑

π

(−1)|π|−1 (|π| − 1)!JπP

– For a partition π, Jπ associates to

the joint the corresponding

factorisation, e.g.,

J13|2|4P = PX1X3PX2PX4 .

1e+04

1e+09

1e+14

1e+19

1 3 5 7 9 11 13 15 17 19 21 23 25D

Nu

mb

er

of

pa

rtitio

ns o

f {1

,...

,D} Bell numbers growth

joint central moments (Lancaster interaction)

vs.

joint cumulants (Streitberg interaction)

Page 114: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Total independence test

• Total independence test:

H0 : PXY Z = PXPY PZ vs. H1 : PXY Z 6= PXPY PZ

Page 115: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Total independence test

• Total independence test:

H0 : PXY Z = PXPY PZ vs. H1 : PXY Z 6= PXPY PZ

• For (X1, . . . , XD) ∼ PX, and κ =⊗D

i=1 k(i):

∥∥∥∥∥∥∥∥∥∥∥

µκ

(PX −

D∏

i=1

PXi

)

︸ ︷︷ ︸∆totP

∥∥∥∥∥∥∥∥∥∥∥

2

=1

n2

n∑

a=1

n∑

b=1

D∏

i=1

K(i)ab − 2

nD+1

n∑

a=1

D∏

i=1

n∑

b=1

K(i)ab

+1

n2D

D∏

i=1

n∑

a=1

n∑

b=1

K(i)ab .

• Coincides with the test proposed by Kankainen (1995) using empirical

characteristic functions: similar relationship to that between dCov and

HSIC (DS et al, 2013)

Page 116: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Example B: total independence tests

∆tot : total indep.

∆L: total indep.

Null

acc

epta

nce

rate

(Typ

eII

erro

r)

Total independence test: Dataset B

Dimension

1 3 5 7 9 11 13 15 17 19

0

0.2

0.4

0.6

0.8

1

Figure 4: Total independence: ∆totP vs. ∆LP , n = 500

Page 117: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Nonparametric Bayesian inference using distributionembeddings

Page 118: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Motivating Example: Bayesian inference without a model

• 3600 downsampled frames of

20× 20 RGB pixels

(Yt ∈ [0, 1]1200)

• 1800 training frames,

remaining for test.

• Gaussian noise added to Yt.

Challenges:

• No parametric model of camera dynamics (only samples)

• No parametric model of map from camera angle to image (only samples)

• Want to do filtering: Bayesian inference

Page 119: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

ABC: an approach to Bayesian inference without a model

Bayes rule:

P(y|x) = P(x|y)π(y)∫P(x|y)π(y)dy

• P(x|y) is likelihood• π(y) is prior

One approach: Approximate Bayesian Computation (ABC)

Page 120: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

ABC: an approach to Bayesian inference without a model

Approximate Bayesian Computation (ABC):

−10 −5 0 5 10

−10

−5

0

5

10

Prior y ∼ π(Y )

Lik

eli

hood

x∼

P(X

|y)

ABC demonstration

Page 121: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

ABC: an approach to Bayesian inference without a model

Approximate Bayesian Computation (ABC):

−10 −5 0 5 10

−10

−5

0

5

10

Prior y ∼ π(Y )

Lik

eli

hood

x∼

P(X

|y)

ABC demonstration

x∗

Page 122: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

ABC: an approach to Bayesian inference without a model

Approximate Bayesian Computation (ABC):

−10 −5 0 5 10

−10

−5

0

5

10

Prior y ∼ π(Y )

Lik

eli

hood

x∼

P(X

|y)

ABC demonstration

x∗

Page 123: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

ABC: an approach to Bayesian inference without a model

Approximate Bayesian Computation (ABC):

−10 −5 0 5 10

−10

−5

0

5

10

Prior y ∼ π(Y )

Lik

eli

hood

x∼

P(X

|y)

ABC demonstration

x∗

Page 124: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

ABC: an approach to Bayesian inference without a model

Approximate Bayesian Computation (ABC):

−10 −5 0 5 10

−10

−5

0

5

10

Prior y ∼ π(Y )

Lik

eli

hood

x∼

P(X

|y)

ABC demonstration

x∗

P (Y |x∗)

Page 125: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

ABC: an approach to Bayesian inference without a model

Approximate Bayesian Computation (ABC):

−10 −5 0 5 10

−10

−5

0

5

10

Prior y ∼ π(Y )

Lik

eli

hood

x∼

P(X

|y)

ABC demonstration

x∗

P (Y |x∗)

Needed: distance measure D, tolerance parameter τ .

Page 126: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

ABC: an approach to Bayesian inference without a model

Bayes rule:

P(y|x) = P(x|y)π(y)∫P(x|y)π(y)dy

• P(x|y) is likelihood• π(y) is prior

ABC generates a sample from p(Y |x∗) as follows:1. generate a sample yt from the prior π,

2. generate a sample xt from P(X|yt),3. if D(x∗, xt) < τ , accept y = yt; otherwise reject,

4. go to (i).

In step (3), D is a distance measure, and τ is a tolerance parameter.

Page 127: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Motivating example 2: simple Gaussian case

• p(x, y) is N ((0,1Td )T , V ) with V a randomly generated covariance

Posterior mean on x: ABC vs kernel approach

100

101

102

103

104

10−2

10−1

CPU time vs Error (6 dim.)

CPU time (sec)

Mea

n S

qu

are

Err

ors

KBI

COND

ABC

3.3×102

1.3×106

7.0×104

1.0×104

2.5×103

9.5×102

200

400

2000

600800

15001000

50006000

40003000

200

400

2000

4000

600800

1000

1500

6000

30005000

Page 128: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Bayes again

Bayes rule:

P(y|x) = P(x|y)π(y)∫P(x|y)π(y)dy

• P(x|y) is likelihood• π is prior

How would this look with kernel embeddings?

Page 129: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Bayes again

Bayes rule:

P(y|x) = P(x|y)π(y)∫P(x|y)π(y)dy

• P(x|y) is likelihood• π is prior

How would this look with kernel embeddings?

Define RKHS G on Y with feature map ψy and kernel l(y, ·)

We need a conditional mean embedding: for all g ∈ G,

EY |x∗g(Y ) = 〈g, µP(y|x∗)〉G

This will be obtained by RKHS-valued ridge regression

Page 130: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Ridge regression and the conditional feature mean

Ridge regression from X := Rd to a finite vector output Y := Rd′ (these

could be d′ nonlinear features of y):

Define training data

X =[x1 . . . xm

]∈ R

d×m Y =[y1 . . . ym

]∈ R

d′×m

Page 131: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Ridge regression and the conditional feature mean

Ridge regression from X := Rd to a finite vector output Y := Rd′ (these

could be d′ nonlinear features of y):

Define training data

X =[x1 . . . xm

]∈ R

d×m Y =[y1 . . . ym

]∈ R

d′×m

Solve

A = argminA∈Rd′×d

(‖Y −AX‖2 + λ‖A‖2HS

),

where

‖A‖2HS = tr(A⊤A) =min{d,d′}∑

i=1

γ2A,i

Page 132: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Ridge regression and the conditional feature mean

Ridge regression from X := Rd to a finite vector output Y := Rd′ (these

could be d′ nonlinear features of y):

Define training data

X =[x1 . . . xm

]∈ R

d×m Y =[y1 . . . ym

]∈ R

d′×m

Solve

A = argminA∈Rd′×d

(‖Y −AX‖2 + λ‖A‖2HS

),

where

‖A‖2HS = tr(A⊤A) =min{d,d′}∑

i=1

γ2A,i

Solution: A = CY X (CXX +mλI)−1

Page 133: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Ridge regression and the conditional feature mean

Prediction at new point x:

y∗ = Ax

= CY X (CXX +mλI)−1 x

=m∑

i=1

βi(x)yi

where

βi(x) = (K + λmI)−1[k(x1, x) . . . k(xm, x)

]⊤

and

K := X⊤X k(x1, x) = x⊤1 x

Page 134: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Ridge regression and the conditional feature mean

Prediction at new point x:

y∗ = Ax

= CY X (CXX +mλI)−1 x

=m∑

i=1

βi(x)yi

where

βi(x) = (K + λmI)−1[k(x1, x) . . . k(xm, x)

]⊤

and

K := X⊤X k(x1, x) = x⊤1 x

What if we do everything in kernel space?

Page 135: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Ridge regression and the conditional feature mean

Recall our setup:

• Given training pairs:

(xi, yi) ∼ PXY

• F on X with feature map ϕx and kernel k(x, ·)• G on Y with feature map ψy and kernel l(y, ·)

We define the covariance between feature maps:

CXX = EX (ϕX ⊗ ϕX) CXY = EXY (ϕX ⊗ ψY )

and matrices of feature mapped training data

X =[ϕx1 . . . ϕxm

]Y :=

[ψy1 . . . ψym

]

Page 136: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Ridge regression and the conditional feature mean

Objective: [Weston et al. (2003), Micchelli and Pontil (2005), Caponnetto and De Vito (2007), Grunewalder

et al. (2012, 2013) ]

A = arg minA∈HS(F ,G)

(EXY ‖Y −AX‖2G + λ‖A‖2HS

), ‖A‖2HS =

∞∑

i=1

γ2A,i

Solution same as vector case:

A = CY X (CXX +mλI)−1 ,

Prediction at new x using kernels:

Aϕx =[ψy1 . . . ψym

](K + λmI)−1

[k(x1, x) . . . k(xm, x)

]

=

m∑

i=1

βi(x)ψyi

where Kij = k(xi, xj)

Page 137: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Ridge regression and the conditional feature mean

How is loss ‖Y −AX‖2G relevant to conditional expectation of some

EY |xg(Y )? Define: [Song et al. (2009), Grunewalder et al. (2013)]

µY |x := Aϕx

Page 138: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Ridge regression and the conditional feature mean

How is loss ‖Y −AX‖2G relevant to conditional expectation of some

EY |xg(Y )? Define: [Song et al. (2009), Grunewalder et al. (2013)]

µY |x := Aϕx

We need A to have the property

EY |xg(Y ) ≈ 〈g, µY |x〉G= 〈g,Aϕx〉G= 〈A∗g, ϕx〉F = (A∗g)(x)

Page 139: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Ridge regression and the conditional feature mean

How is loss ‖Y −AX‖2G relevant to conditional expectation of some

EY |xg(Y )? Define: [Song et al. (2009), Grunewalder et al. (2013)]

µY |x := Aϕx

We need A to have the property

EY |xg(Y ) ≈ 〈g, µY |x〉G= 〈g,Aϕx〉G= 〈A∗g, ϕx〉F = (A∗g)(x)

Natural risk function for conditional mean

L(A,PXY ) := sup‖g‖≤1

EX

(EY |Xg(Y )

)︸ ︷︷ ︸

Target

(X)− (A∗g)︸ ︷︷ ︸Estimator

(X)

2

,

Page 140: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Ridge regression and the conditional feature mean

The squared loss risk provides an upper bound on the natural risk.

L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G

Page 141: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Ridge regression and the conditional feature mean

The squared loss risk provides an upper bound on the natural risk.

L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G

Proof: Jensen and Cauchy Schwarz

L(A,PXY ) := sup‖g‖≤1

EX

[(EY |Xg(Y )

)(X)− (A∗g) (X)

]2

≤ EXY sup‖g‖≤1

[g(Y )− (A∗g) (X)]2

Page 142: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Ridge regression and the conditional feature mean

The squared loss risk provides an upper bound on the natural risk.

L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G

Proof: Jensen and Cauchy Schwarz

L(A,PXY ) := sup‖g‖≤1

EX

[(EY |Xg(Y )

)(X)− (A∗g) (X)

]2

≤ EXY sup‖g‖≤1

[g(Y )− (A∗g) (X)]2

= EXY sup‖g‖≤1

[〈g, ψY 〉G − 〈A∗g, ϕX〉F ]2

Page 143: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Ridge regression and the conditional feature mean

The squared loss risk provides an upper bound on the natural risk.

L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G

Proof: Jensen and Cauchy Schwarz

L(A,PXY ) := sup‖g‖≤1

EX

[(EY |Xg(Y )

)(X)− (A∗g) (X)

]2

≤ EXY sup‖g‖≤1

[g(Y )− (A∗g) (X)]2

= EXY sup‖g‖≤1

[〈g, ψY 〉G − 〈g,AϕX〉G ]2

Page 144: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Ridge regression and the conditional feature mean

The squared loss risk provides an upper bound on the natural risk.

L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G

Proof: Jensen and Cauchy Schwarz

L(A,PXY ) := sup‖g‖≤1

EX

[(EY |Xg(Y )

)(X)− (A∗g) (X)

]2

≤ EXY sup‖g‖≤1

[g(Y )− (A∗g) (X)]2

= EXY sup‖g‖≤1

〈g, ψY −AϕX〉2G

Page 145: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Ridge regression and the conditional feature mean

The squared loss risk provides an upper bound on the natural risk.

L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G

Proof: Jensen and Cauchy Schwarz

L(A,PXY ) := sup‖g‖≤1

EX

[(EY |Xg(Y )

)(X)− (A∗g) (X)

]2

≤ EXY sup‖g‖≤1

[g(Y )− (A∗g) (X)]2

= EXY sup‖g‖≤1

〈g, ψY −AϕX〉2G

≤ EXY ‖ψY −AϕX‖2G

Page 146: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Ridge regression and the conditional feature mean

The squared loss risk provides an upper bound on the natural risk.

L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G

Proof: Jensen and Cauchy Schwarz

L(A,PXY ) := sup‖g‖≤1

EX

[(EY |Xg(Y )

)(X)− (A∗g) (X)

]2

≤ EXY sup‖g‖≤1

[g(Y )− (A∗g) (X)]2

= EXY sup‖g‖≤1

〈g, ψY −AϕX〉2G

≤ EXY ‖ψY −AϕX‖2G

If we assume EY [g(Y )|X = x] ∈ F then upper bound tight (next slide).

Page 147: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Conditions for ridge regression = conditional mean

Conditional mean obtained by ridge regression when EY [g(Y )|X = x] ∈ FGiven a function g ∈ G. Assume EY |X [g(Y )|X = ·] ∈ F . Then

CXXEY |X [g(Y )|X = ·] = CXY g.

Why this is useful:

EY |X [g(Y )|X = x] = 〈EY |X [g(Y )|X = ·], ϕx〉F= 〈C−1

XXCXY g, ϕx〉F= 〈g, CY XC

−1XX︸ ︷︷ ︸

regression

ϕx〉G

Page 148: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Conditions for ridge regression = conditional mean

Conditional mean obtained by ridge regression when EY [g(Y )|X = x] ∈ FGiven a function g ∈ G. Assume EY |X [g(Y )|X = ·] ∈ F . Then

CXXEY |X [g(Y )|X = ·] = CXY g.

Proof : [Fukumizu et al., 2004]

For all f ∈ F , by definition of CXX ,

⟨f, CXXEY |X [g(Y )|X = ·]

⟩F

= cov(f,EY |X [g(Y )|X = ·]

)

= EX

(f(X) EY |X [g(Y )|X]

)

= EXY (f(X)g(Y ))

= 〈f, CXY g〉 ,

by definition of CXY .

Page 149: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Kernel Bayes’ law

• Prior: Y ∼ π(y)

• Likelihood: (X|y) ∼ P(x|y) with some joint P(x, y)

Page 150: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Kernel Bayes’ law

• Prior: Y ∼ π(y)

• Likelihood: (X|y) ∼ P(x|y) with some joint P(x, y)

• Joint distribution: Q(x, y) = P(x|y)π(y)

Warning: Q 6= P, change of measure from P(y) to π(y)

• Marginal for x:

Q(x) :=

∫P(x|y)π(y)dy.

• Bayes’ law:

Q(y|x) = P(x|y)π(y)Q(x)

Page 151: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Kernel Bayes’ law

• Posterior embedding via the usual conditional update,

µQ(y|x) = CQ(y,x)C−1Q(x,x)φx.

Page 152: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Kernel Bayes’ law

• Posterior embedding via the usual conditional update,

µQ(y|x) = CQ(y,x)C−1Q(x,x)φx.

• Given mean embedding of prior: µπ(y)

• Define marginal covariance:

CQ(x,x) =

∫(ϕx ⊗ ϕx) P(x|y)π(y)dx = C(xx)yC

−1yy µπ(y)

• Define cross-covariance:

CQ(y,x) =

∫(φy ⊗ ϕx) P(x|y)π(y)dx = C(yx)yC

−1yy µπ(y).

Page 153: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Kernel Bayes’ law: consistency result

• How to compute posterior expectation from data?

• Given samples: {(xi, yi)}ni=1 from Pxy, {(uj)}nj=1 from prior π.

• Want to compute E[g(Y )|X = x] for g in G• For any x ∈ X ,

∣∣gTy RY |XkX(x)− E[f(Y )|X = x]

∣∣ = Op(n− 4

27 ), (n→ ∞),

where

– gy = (g(y1), . . . , g(yn))T ∈ Rn.

– kX(x) = (k(x1, x), . . . , k(xn, x))T ∈ Rn

– RY |X learned from the samples, contains the uj

Smoothness assumptions:

• π/pY ∈ R(C1/2Y Y ), where pY p.d.f. of PY ,

• E[g(Y )|X = ·] ∈ R(C2Q(xx)

).

Page 154: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Experiment: Kernel Bayes’ law vs EKF

Page 155: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Experiment: Kernel Bayes’ law vs EKF

• Compare with extended Kalman

filter (EKF) on camera

orientation task

• 3600 downsampled frames of

20× 20 RGB pixels

(Yt ∈ [0, 1]1200)

• 1800 training frames, remaining

for test.

• Gaussian noise added to Yt.

Page 156: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Experiment: Kernel Bayes’ law vs EKF

• Compare with extended Kalman

filter (EKF) on camera

orientation task

• 3600 downsampled frames of

20× 20 RGB pixels

(Yt ∈ [0, 1]1200)

• 1800 training frames, remaining

for test.

• Gaussian noise added to Yt.

Average MSE and standard errors (10 runs)

KBR (Gauss) KBR (Tr) Kalman (9 dim.) Kalman (Quat.)

σ2 = 10−4 0.210± 0.015 0.146± 0.003 1.980± 0.083 0.557± 0.023

σ2 = 10−3 0.222± 0.009 0.210± 0.008 1.935± 0.064 0.541± 0.022

Page 157: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Co-authors

• From UCL:

– Kacper Chwialkowski

– Heiko Strathmann

• External:

– Wicher Bergsma, LSE

– Karsten Borgwardt, ETH

Zurich

– Kenji Fukumizu, ISM

– Bernhard Schoelkopf, MPI

– Alexander Smola,

CMU/Google

– Le Song, Georgia Tech

– Dino Sejdinovic, Oxford

– Bharath Sriperumbudur,

Penn State

Page 158: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Questions?

Page 159: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Energy Distance and the MMD

Page 160: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Energy distance and MMD

Distance between probability distributions:

Energy distance: [Baringhaus and Franz, 2004, Szekely and Rizzo, 2004, 2005]

DE(P,Q) = EP‖X −X ′‖+EQ‖Y − Y ′‖ − 2EP,Q‖X − Y ‖

Maximum mean discrepancy [Gretton et al., 2007, Smola et al., 2007, Gretton et al., 2012a]

MMD2(P,Q;F ) = EPk(X,X′) +EQk(Y, Y

′)− 2EP,Qk(X,Y )

Page 161: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Energy distance and MMD

Distance between probability distributions:

Energy distance: [Baringhaus and Franz, 2004, Szekely and Rizzo, 2004, 2005]

DE(P,Q) = EP‖X −X ′‖+EQ‖Y − Y ′‖ − 2EP,Q‖X − Y ‖

Maximum mean discrepancy [Gretton et al., 2007, Smola et al., 2007, Gretton et al., 2012a]

MMD2(P,Q;F ) = EPk(X,X′) +EQk(Y, Y

′)− 2EP,Qk(X,Y )

Energy distance is MMD with a particular kernel!

Page 162: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Distance covariance and HSIC

Distance covariance [Szekely and Rizzo, 2009, Szekely et al., 2007]

V2(X,Y ) = EXY EX′Y ′

[‖X −X ′‖‖Y − Y ′‖

]

+ EXEX′‖X −X ′‖EY EY ′‖Y − Y ′‖− 2EXY

[EX′‖X −X ′‖EY ′‖Y − Y ′‖

]

Hilbert-Schmdit Indepence Criterion [Gretton et al., 2005, Smola et al., 2007, Gretton et al.,

2008, Gretton and Gyorfi, 2010]Define RKHS F on X with kernel k, RKHS G on Ywith kernel l. Then

HSIC(PXY ,PXPY )

= EXY EX′Y ′k(X,X ′)l(Y, Y ′) +EXEX′k(X,X ′)EY EY ′ l(Y, Y ′)

− 2EX′Y ′

[EXk(X,X

′)EY l(Y, Y′)].

Page 163: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Distance covariance and HSIC

Distance covariance [Szekely and Rizzo, 2009, Szekely et al., 2007]

V2(X,Y ) = EXY EX′Y ′

[‖X −X ′‖‖Y − Y ′‖

]

+ EXEX′‖X −X ′‖EY EY ′‖Y − Y ′‖− 2EXY

[EX′‖X −X ′‖EY ′‖Y − Y ′‖

]

Hilbert-Schmdit Indepence Criterion [Gretton et al., 2005, Smola et al., 2007, Gretton et al.,

2008, Gretton and Gyorfi, 2010]Define RKHS F on X with kernel k, RKHS G on Ywith kernel l. Then

HSIC(PXY ,PXPY )

= EXY EX′Y ′k(X,X ′)l(Y, Y ′) +EXEX′k(X,X ′)EY EY ′ l(Y, Y ′)

− 2EX′Y ′

[EXk(X,X

′)EY l(Y, Y′)].

Distance covariance is HSIC with particular kernels!

Page 164: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Semimetrics and Hilbert spaces

Theorem [Berg et al., 1984, Lemma 2.1, p. 74]

ρ : X × X → R is a semimetric on X . Let z0 ∈ X , and denote

kρ(z, z′) = ρ(z, z0) + ρ(z′, z0)− ρ(z, z′).

Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS)

iff ρ is of negative type.

Call kρ a distance induced kernel

Page 165: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Semimetrics and Hilbert spaces

Theorem [Berg et al., 1984, Lemma 2.1, p. 74]

ρ : X × X → R is a semimetric on X . Let z0 ∈ X , and denote

kρ(z, z′) = ρ(z, z0) + ρ(z′, z0)− ρ(z, z′).

Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS)

iff ρ is of negative type.

Call kρ a distance induced kernel

Special case: Z ⊆ Rd and ρq(z, z′) = ‖z − z′‖q. Then ρq is a valid semimetric

of negative type for 0 < q ≤ 2.

Page 166: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Semimetrics and Hilbert spaces

Theorem [Berg et al., 1984, Lemma 2.1, p. 74]

ρ : X × X → R is a semimetric on X . Let z0 ∈ X , and denote

kρ(z, z′) = ρ(z, z0) + ρ(z′, z0)− ρ(z, z′).

Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS)

iff ρ is of negative type.

Call kρ a distance induced kernel

Special case: Z ⊆ Rd and ρq(z, z′) = ‖z − z′‖q. Then ρq is a valid semimetric

of negative type for 0 < q ≤ 2.

Energy distance is MMD with a distance induced kernel

Distance covariance is HSIC with distance induced kernels

Page 167: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Two-sample testing benchmark

Two-sample testing example in 1-D:

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

X

P(X

)

VS

−6 −4 −2 0 2 4 6

0.1

0.2

0.3

0.4

X

Q(X

)

−6 −4 −2 0 2 4 6

0.1

0.2

0.3

0.4

X

Q(X

)

−6 −4 −2 0 2 4 6

0.1

0.2

0.3

0.4

X

Q(X

)

Page 168: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Two-sample test, MMD with distance kernel

Obtain more powerful tests on this problem when q 6= 1 (exponent of distance)

Key:

• Gaussian kernel

• q = 1

• Best: q = 1/3

• Worst: q = 2

Page 169: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Selected references

Characteristic kernels and mean embeddings:

• Smola, A., Gretton, A., Song, L., Schoelkopf, B. (2007). A hilbert space embedding for distributions. ALT.

• Sriperumbudur, B., Gretton, A., Fukumizu, K., Schoelkopf, B., Lanckriet, G. (2010). Hilbert space

embeddings and metrics on probability measures. JMLR.

• Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR.

Two-sample, independence, conditional independence tests:

• Gretton, A., Fukumizu, K., Teo, C., Song, L., Schoelkopf, B., Smola, A. (2008). A kernel statistical test of

independence. NIPS

• Fukumizu, K., Gretton, A., Sun, X., Schoelkopf, B. (2008). Kernel measures of conditional dependence.

• Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B. (2009). A fast, consistent kernel two-sample

test. NIPS.

• Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR

Energy distance, relation to kernel distances

• Sejdinovic, D., Sriperumbudur, B., Gretton, A.,, Fukumizu, K., (2013). Equivalence of distance-based and

rkhs-based statistics in hypothesis testing. Annals of Statistics.

Three way interaction

• Sejdinovic, D., Gretton, A., and Bergsma, W. (2013). A Kernel Test for Three-Variable Interactions. NIPS.

Page 170: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

Selected references (continued)

Conditional mean embedding, RKHS-valued regression:

• Weston, J., Chapelle, O., Elisseeff, A., Scholkopf, B., and Vapnik, V., (2003). Kernel Dependency

Estimation, NIPS.

• Micchelli, C., and Pontil, M., (2005). On Learning Vector-Valued Functions. Neural Computation.

• Caponnetto, A., and De Vito, E. (2007). Optimal Rates for the Regularized Least-Squares Algorithm.

Foundations of Computational Mathematics.

• Song, L., and Huang, J., and Smola, A., Fukumizu, K., (2009). Hilbert Space Embeddings of Conditional

Distributions. ICML.

• Grunewalder, S., Lever, G., Baldassarre, L., Patterson, S., Gretton, A., Pontil, M. (2012). Conditional mean

embeddings as regressors. ICML.

• Grunewalder, S., Gretton, A., Shawe-Taylor, J. (2013). Smooth operators. ICML.

Kernel Bayes rule:

• Song, L., Fukumizu, K., Gretton, A. (2013). Kernel embeddings of conditional distributions: A unified

kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine.

• Fukumizu, K., Song, L., Gretton, A. (2013). Kernel Bayes rule: Bayesian inference with positive definite

kernels, JMLR

Page 171: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015
Page 172: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

References

L.

Barin

ghaus

and

C.

Franz.

On

anew

multiv

aria

te

two-s

am

ple

test.

J.

Multiv

ariate

Anal.,

88:1

90–206,2004.

C.Berg,J.P.R.Christensen,and

P.Ressel.

Harmonic

Analysis

on

Semigro

ups.

Sprin

ger,New

York,1984.

Andrey

Feuerverger.A

consistenttestfo

rbiv

aria

te

dependence.In

ternatio

nal

Sta

tistic

alReview,61(3):4

19–433,1993.

K.Fukum

izu,F.R.Bach,and

M.I.

Jordan.

Dim

ensio

nalit

yreductio

nfo

rsupervised

learnin

gwith

reproducin

gkernel

Hilb

ert

spaces.

Journalof

Machin

eLearnin

gResearch,5:7

3–99,2004.

K.Fukum

izu,

A.G

retton,

X.Sun,

and

B.Scholk

opf.

Kernelm

easures

of

conditio

naldependence.In

Advancesin

Neura

lIn

formatio

nProcessin

gSystems

20,pages

489–496,Cam

brid

ge,M

A,2008.M

ITPress.

A.G

retton

and

L.G

yorfi.

Consistent

nonparam

etric

tests

ofin

dependence.

JournalofM

achin

eLearnin

gResearch,11:1

391–1423,2010.

A.G

retton,O

.Bousquet,A.J.Sm

ola

,and

B.Scholk

opf.

Measurin

gstatisti-

caldependencewith

Hilb

ert-S

chm

idtnorm

s.In

S.Jain

,H.U.Sim

on,and

E.Tom

ita,editors,Proceedin

gsofth

eIn

ternatio

nalConference

on

Algorith

mic

Learnin

gTheory,pages

63–77.Sprin

ger-V

erla

g,2005.

A.G

retton,K

.Borgwardt,M

.Rasch,B.Scholk

opf,

and

A.J.Sm

ola

.A

ker-

nelm

ethod

for

the

two-s

am

ple

proble

m.

InAdvancesin

Neura

lIn

formatio

n

Processin

gSystems15,pages

513–520,Cam

brid

ge,M

A,2007.M

ITPress.

A.G

retton,K

.Fukum

izu,C.-H

.Teo,L.Song,B.Scholk

opf,

and

A.J.Sm

ola

.A

kernelstatistic

altestofin

dependence.In

Advancesin

Neura

lIn

formatio

n

Processin

gSystems20,pages

585–592,Cam

brid

ge,M

A,2008.M

ITPress.

A.G

retton,K

.Borgwardt,M

.Rasch,B.Schoelk

opf,

and

A.Sm

ola

.A

kernel

two-s

am

ple

test.

JM

LR,13:7

23–773,2012a.

A.G

retton,B.Srip

erum

budur,D

.Sejd

inovic

,H.Strathm

ann,S.Bala

krish-

nan,M

.Pontil,

and

K.Fukum

izu.

Optim

alkernelchoic

efo

rla

rge-s

cale

two-s

am

ple

tests.

InNIP

S,2012b.

T.Read

and

N.Cressie

.Goodness-O

f-Fit

Sta

tistic

sforDiscrete

Multiv

ariate

Anal-

ysis.

Sprin

ger-V

erla

g,New

York,1988.

104-1

Page 173: Kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... · Kernelnonparametrictestsof homogeneity,independence,and multi-variableinteraction Imperial,2015

A.J.Sm

ola

,A.G

retton,L.Song,and

B.Scholk

opf.

AHilb

ert

space

em

-beddin

gfo

rdistrib

utio

ns.

InProceedin

gs

ofth

eIn

ternatio

nalConference

on

Algorith

mic

Learnin

gTheory,volu

me

4754,pages

13–31.Sprin

ger,2007.

G.Szekely

and

M.Riz

zo.Testin

gfo

requaldistrib

utio

nsin

hig

hdim

ensio

n.

InterSta

t,5,2004.

G.Szekely

and

M.Riz

zo.A

new

testfo

rm

ultiv

aria

tenorm

alit

y.J.M

ultiv

ariate

Anal.,

93:5

8–80,2005.

G.Szekely

and

M.Riz

zo.

Brownia

ndistance

covaria

nce.

Annals

ofApplie

d

Sta

tistic

s,4(3):1

233–1303,2009.

G.Szekely

,M

.Riz

zo,and

N.Bakirov.M

easurin

gand

testin

gdependence

by

correla

tio

nofdistances.

Ann.Sta

t.,35(6):2

769–2794,2007.

K.Zhang,J.Peters,D

.Janzin

g,B.,

and

B.Scholk

opf.

Kernel-b

ased

con-

ditio

nalin

dependence

test

and

applic

atio

nin

causaldiscovery.

In27th

Conferenceon

Uncerta

inty

inArtifi

cialIn

tellig

ence,pages

804–813,2011.

104-2