kernelnonparametrictestsof homogeneity,independence,and ...mpd37/teaching/2015/ml... ·...
TRANSCRIPT
Kernel nonparametric tests ofhomogeneity, independence, and
multi-variable interaction
Imperial, 2015
Arthur Gretton
Gatsby Unit, CSML, UCL
Motivating example: detect differences in AM signals
Samples from P Samples from Q
Case of discrete domains
• How do you compare distributions. . .
• . . .in a discrete domain? [Read and Cressie, 1988]
Case of discrete domains
• How do you compare distributions. . .
• . . .in a discrete domain? [Read and Cressie, 1988]
X1: Now disturbing reports out of Newfound-
land show that the fragile snow crab industry is
in serious decline. First the west coast salmon,
the east coast salmon and the cod, and now the
snow crabs off Newfoundland.
Y1: Honourable senators, I have a question for
the Leader of the Government in the Senate with
regard to the support funding to farmers that has
been announced. Most farmers have not received
any money yet.
X2: To my pleasant surprise he responded that
he had personally visited those wharves and that
he had already announced money to fix them.
What wharves did the minister visit in my riding
and how much additional funding is he going to
provide for Delaps Cove, Hampton, Port Lorne,
· · ·
?
PX = PY
Y2:On the grain transportation system we have
had the Estey report and the Kroeger report.
We could go on and on. Recently programs have
been announced over and over by the government
such as money for the disaster in agriculture on
the prairies and across Canada.
· · ·
Are the pink extracts from the same distribution as the gray ones?
Detecting statistical dependence
• How do you detect dependence. . .
• . . .in a discrete domain? [Read and Cressie, 1988]
Detecting statistical dependence
• How do you detect dependence. . .
• . . .in a discrete domain? [Read and Cressie, 1988]
Detecting statistical dependence
• How do you detect dependence. . .
• . . .in a discrete domain? [Read and Cressie, 1988]
!"#$%&' ()'*+,' -./,'
#0.1+' 2345' 2326'
78'.0.1+' 2325' 2396'
Detecting statistical dependence
• How do you detect dependence. . .
• . . .in a discrete domain? [Read and Cressie, 1988]
!"#$%&' ()'*+,' -./,'
#0.1+' 2342' 2352'
67'.0.1+' 2358' 2389'
Detecting statistical dependence, discrete domain
• How do you detect dependence. . .
• . . .in a discrete domain? [Read and Cressie, 1988]
X1: Honourable senators, I have a ques-
tion for the Leader of the Government in the
Senate with regard to the support funding
to farmers that has been announced. Most
farmers have not received any money yet.
Y1: Honorables senateurs, ma question
s’adresse au leader du gouvernement au
Senat et concerne l’aide financiere qu’on a
annoncee pour les agriculteurs. La plupart
des agriculteurs n’ont encore rien reu de cet
argent.
X2: No doubt there is great pressure on
provincial and municipal governments in re-
lation to the issue of child care, but the re-
ality is that there have been no cuts to child
care funding from the federal government to
the provinces. In fact, we have increased
federal investments for early childhood de-
velopment.
· · ·
?
PXY = PXPY
Y2:Il est evident que les ordres de
gouvernements provinciaux et municipaux
subissent de fortes pressions en ce qui con-
cerne les services de garde, mais le gou-
vernement n’a pas reduit le financement
qu’il verse aux provinces pour les services de
garde. Au contraire, nous avons augmente le
financement federal pour le developpement
des jeunes enfants.
· · ·
Are the French text extracts translations of the English ones?
Detecting statistical dependence, continuous domain
• How do you detect dependence. . .
• . . .in a continuous domain?
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
X
Y
Sample from PXY
ր?
ց
Dependent PXY
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
Independent PXY
=PX P
Y
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
Detecting statistical dependence, continuous domain
• How do you detect dependence. . .
• . . .in a continuous domain?
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
X
Y
Sample from PXY
ր?
ց
Discretized empirical PXY
Discretized empirical PX P
Y
Detecting statistical dependence, continuous domain
• How do you detect dependence. . .
• . . .in a continuous domain?
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
X
Y
Sample from PXY
ր?
ց
Discretized empirical PXY
Discretized empirical PX P
Y
Detecting statistical dependence, continuous domain
• How do you detect dependence. . .
• . . .in a continuous domain?
• Problem: fails even in “low” dimensions! [NIPS07a, ALT08]
– X and Y in R4, statistic=Power divergence, samples= 1024, cases
where dependence detected=0/500
• Too few points per bin
Outline
• An RKHS metric on the space of probability measures
– Distance between means in space of features (RKHS)
– Nonparametric two-sample test...
– ...for (almost!) any data type: strings, images, graphs, groups
(rotation matrices), semigroups (histograms),...
– Two-sample testing for big data, how to choose the kernel [NIPS12]
• Dependence detection
– Distance between joint Pxy and product of marginals PxPy
– Testing three-way interactions [NIPS13]
• Nonparametric Bayesian inference
Feature mean difference
• Simple example: 2 Gaussians with different means
• Answer: t-test
−6 −4 −2 0 2 4 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Two Gaussians with different means
X
Pro
b. density
P
X
QX
Feature mean difference
• Two Gaussians with same means, different variance
• Idea: look at difference in means of features of the RVs
• In Gaussian case: second order features of form ϕ(x) = x2
−6 −4 −2 0 2 4 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Two Gaussians with different variances
Pro
b.
de
nsity
X
P
X
QX
Feature mean difference
• Two Gaussians with same means, different variance
• Idea: look at difference in means of features of the RVs
• In Gaussian case: second order features of form ϕx = x2
−6 −4 −2 0 2 4 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Two Gaussians with different variances
Pro
b.
de
nsity
X
P
X
QX
10−1
100
101
102
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Densities of feature X2
X2
Pro
b.
de
nsity
P
X
QX
Feature mean difference
• Gaussian and Laplace distributions
• Same mean and same variance
• Difference in means using higher order features
−4 −3 −2 −1 0 1 2 3 40
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Gaussian and Laplace densities
Pro
b. density
X
P
X
QX
. . . so let’s explore feature representations!
Kernels: similarity between features
Kernel:
• We have two objects x and x′ from a set X (documents, images, . . .).
How similar are they?
Kernels: similarity between features
Kernel:
• We have two objects x and x′ from a set X (documents, images, . . .).
How similar are they?
• Define features of objects:
– ϕx are features of x,
– ϕx′ are features of x′
• A kernel is the dot product between these features:
k(x, x′) := 〈ϕx, ϕx′〉F .
Good features make problems easy: XOR example
−5 −4 −3 −2 −1 0 1 2 3 4 5−5
−4
−3
−2
−1
0
1
2
3
4
5
x1
x2
• No linear classifier separates red from blue
• Map points to higher dimensional feature space:
φx =[x1 x2 x1x2
]∈ R3
Good features make problems easy: XOR example
Example: A three dimensional space of features of points in R2:
ϕ : R2 → R
3
x =
x1
x2
7→ ϕx =
x1
x2
x1x2
,
with kernel
k(x, y) =
x1
x2
x1x2
⊤
y1
y2
y1y2
(the standard inner product in R3 between features). Denote this feature
space by F .
Functions of features
Define a linear function of the inputs x1, x2, and their product x1x2,
f(x) = f1x1 + f2x2 + f3x1x2.
f in a space of functions mapping from X = R2 to R. Equivalent
representation for f ,
f =[f1 f2 f3
]⊤.
f refers to the function as an object (here as a vector in R3)
f(x) ∈ R is function evaluated at a point (a real number).
Functions of features
Define a linear function of the inputs x1, x2, and their product x1x2,
f(x) = f1x1 + f2x2 + f3x1x2.
f in a space of functions mapping from X = R2 to R. Equivalent
representation for f ,
f =[f1 f2 f3
]⊤.
f refers to the function as an object (here as a vector in R3)
f(x) ∈ R is function evaluated at a point (a real number).
f(x) = f⊤ϕx = 〈f, ϕx〉FEvaluation of f at x is an inner product in feature space (here standard
inner product in R3)
F is a space of functions mapping R2 to R.
The reproducing property
This example illustrates the two defining features of an RKHS:
• The reproducing property: ∀x ∈ X , ∀f ∈ F , 〈f, ϕx〉F = f(x)
• In particular, for any x, y ∈ X ,
k(x, y) = 〈ϕx, ϕy〉F .
Note: the feature map of every point is in the feature space: ∀x ∈ X , ϕx ∈ F
Infinite dimensional feature space
Reproducing property for function with Gaussian kernel:
f(x) :=∑m
i=1 αik(xi, x) = 〈∑mi=1 αiϕxi , ϕx〉F .
−6 −4 −2 0 2 4 6 8−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
x
f(x)
Infinite dimensional feature space
Reproducing property for function with Gaussian kernel:
f(x) :=∑m
i=1 αik(xi, x) = 〈∑mi=1 αiϕxi , ϕx〉F .
−6 −4 −2 0 2 4 6 8−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
x
f(x)
• What do the features ϕx look like (there are infinitely many of them!)
• What do these features have to do with smoothness?
Infinite dimensional feature space
Gaussian kernel, k(x, x′) = exp(−‖x−x′‖2
2σ2
),
λk ∝ bk b < 1
ek(x) ∝ exp(−(c− a)x2)Hk(x√2c),
a, b, c are functions of σ, and Hk is kth order Hermite polynomial.
k(x, x′)
=∞∑
i=1
λiei(x)ei(x′)
=∞∑
i=1
(√λiei(x)
)(√λiei(x
′))
=∞∑
i=1
ϕxϕx′
Infinite dimensional feature space
Example RKHS function, Gaussian kernel:
f(x) :=m∑
i=1
αik(xi, x) =m∑
i=1
αi
∞∑
j=1
λjej(xi)ej(x)
=
∞∑
j=1
fj
[√λjej(x)
]
where fj =∑m
i=1 αi
√λjej(xi).
−6 −4 −2 0 2 4 6 8−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
x
f(x)
NOTE that this
enforces smoothing:
λj decay as ej
become rougher,
fj decay since
‖f‖2F =∑
j f2j <∞.
Probabilities in feature space: the mean trick
The kernel trick
• Given x ∈ X for some set X ,
define feature map ϕx ∈ F ,
ϕx =[. . .√λiei(x) . . .
]∈ ℓ2
• For positive definite k(x, x′),
k(x, x′) = 〈ϕx, ϕx′〉F
• The kernel trick: ∀f ∈ F ,
f(x) = 〈f, ϕx〉F
Probabilities in feature space: the mean trick
The kernel trick
• Given x ∈ X for some set X ,
define feature map ϕx ∈ F ,
ϕx =[. . .√λiei(x) . . .
]∈ ℓ2
• For positive definite k(x, x′),
k(x, x′) = 〈ϕx, ϕx′〉F
• The kernel trick: ∀f ∈ F ,
f(x) = 〈f, ϕx〉F
The mean trick
• Given P a Borel probability
measure on X , define feature
map µP ∈ F
µP =[. . .√λiEP [ei(X)] . . .
]∈ ℓ2
• For positive definite k(x, x′),
EP,Qk(X,Y ) = 〈µP, µQ〉F
for X ∼ P and Y ∼ Q.
• The mean trick: (we call µP a
mean/distribution embedding)
EP(f(X)) =: 〈µP, f〉F
Feature embeddings of probabilities
The kernel trick:
f(x) = 〈f, ϕx〉F
The mean trick:
EP(f(X)) = 〈f, µP〉FEmpirical mean embedding:
µP = m−1m∑
i=1
ϕxi xii.i.d.∼ P
µP gives you expectations of all RKHS functions
When k characteristic, then µP unique, e.g. Gauss, Laplace, . . .
The maximum mean discrepancy
The maximum mean discrepancy is the squared distance between feature
means:
MMD(P,Q) = ‖µP − µQ‖2F = 〈µP, µP〉F + 〈µQ, µQ〉F − 2 〈µP, µQ〉F= EPk(x, x
′) +EQk(y, y′)− 2EP,Qk(x, y)
The maximum mean discrepancy
The maximum mean discrepancy is the squared distance between feature
means:
MMD(P,Q) = ‖µP − µQ‖2F = 〈µP, µP〉F + 〈µQ, µQ〉F − 2 〈µP, µQ〉F= EPk(x, x
′) +EQk(y, y′)− 2EP,Qk(x, y)
An unbiased empirical estimate: for {xi}mi=1 ∼ P and {yi}mi=1 ∼ Q,
MMD = 1m(m−1)
∑mi=1
∑mj 6=i k(xi, xj)
+ 1m(m−1)
∑mi=1
∑mj 6=i k(yi, yj)
−2∑m
i=1
∑mj=1 k(yi, xj)
Maximum mean discrepancy: main idea
KP,P =
k(X1, X1) . . . k(X1, Xn)...
. . ....
k(Xn, X1) . . . k(Xn, Xn)
=
KQ,Q =
k(Y1, Y1) . . . k(Y1, Yn)...
. . ....
k(Yn, Y1) . . . k(Yn, Yn)
=
KP ,Q =
k(X1, Y1) . . . k(X1, Yn)...
. . ....
k(Xn, Y1) . . . k(Xn, Yn)
=
Maximum mean discrepancy: main idea
P 6= Q⇒
KP,P KP ,Q
KQ,P KQ,Q
=
P = Q⇒
KP,P KP,Q
KQ,P KQ,Q
=
Statistical hypothesis testing
Statistical test using MMD
• Two hypotheses:
– H0: null hypothesis (P = Q)
– H1: alternative hypothesis (P 6= Q)
Statistical test using MMD
• Two hypotheses:
– H0: null hypothesis (P = Q)
– H1: alternative hypothesis (P 6= Q)
• Observe samples x := {x1, . . . , xm} from P and y from Q
• If empirical MMD2is
– “far from zero”: reject H0
– “close to zero”: accept H0
Behavior of empirical MMD
Vn := MMD = KP,P +KQ,Q − 2KP,Q
Statistical test using MMD
• When P = Q, U-statistic degenerate: [Gretton et al., 2007, 2012a]
• Distribution is
mMMD2 ∼
∞∑
l=1
λl[z2l − 2
]
• where
– zl ∼ N (0, 2) i.i.d
–∫X k(x, x
′)︸ ︷︷ ︸centred
ψi(x)dP(x) = λiψi(x′)
−0.04 −0.02 0 0.02 0.04 0.06 0.08 0.10
10
20
30
40
50
MMD density under H0
MMD
Pro
b.
de
nsity
Empirical PDF
Approximate distribution of MMD via permutation
w = (−1, 1, 1, · · · ,−1,−1,−1)⊤
KP,P KP,Q
KQ,P KQ,Q
⊙
[ww⊤]
=[?]
⊙ =
⊙ =
Approximate distribution of MMD via permutation
Vn,p := MMDp =
KP,P KP ,Q
KQ,P KQ,Q
⊙
[ww⊤]
Experiment: amplitude modulated signals
Samples from P Samples from Q
Results: AM signals
−0.2 0 0.2 0.4 0.6 0.8 1 1.20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
added noise
Typ
e I
I e
rro
r
max ratio
opt
med
l2
maxmmd
m = 10, 000 and scaling a = 0.5. Average over 4124 trials. Gaussian noise
added.
Linear-cost test and kernel choice approach from Gretton et al. [2012b]
Kernel two-sample tests for big data, optimal kernel choice
Quadratic time estimate of MMD
MMD2 = ‖µP − µQ‖2F = EPk(x, x′) +EQk(y, y
′)− 2EP,Qk(x, y)
Quadratic time estimate of MMD
MMD2 = ‖µP − µQ‖2F = EPk(x, x′) +EQk(y, y
′)− 2EP,Qk(x, y)
Given i.i.d. X := {x1, . . . , xm} and Y := {y1, . . . , ym} from P,Q,
respectively:
The earlier estimate: (quadratic time)
EPk(x, x′) =
1
m(m− 1)
m∑
i=1
m∑
j 6=i
k(xi, xj)
Quadratic time estimate of MMD
MMD2 = ‖µP − µQ‖2F = EPk(x, x′) +EQk(y, y
′)− 2EP,Qk(x, y)
Given i.i.d. X := {x1, . . . , xm} and Y := {y1, . . . , ym} from P,Q,
respectively:
The earlier estimate: (quadratic time)
EPk(x, x′) =
1
m(m− 1)
m∑
i=1
m∑
j 6=i
k(xi, xj)
New, linear time estimate:
EPk(x, x′) =
2
m[k(x1, x2) + k(x3, x4) + . . .]
=2
m
m/2∑
i=1
k(x2i−1, x2i)
Linear time MMD
Shorter expression with explicit k dependence:
MMD2 =: ηk(p, q) = Exx′yy′hk(x, x′, y, y′) =: Evhk(v),
where
hk(x, x′, y, y′) = k(x, x′) + k(y, y′)− k(x, y′)− k(x′, y),
and v := [x, x′, y, y′].
Linear time MMD
Shorter expression with explicit k dependence:
MMD2 =: ηk(p, q) = Exx′yy′hk(x, x′, y, y′) =: Evhk(v),
where
hk(x, x′, y, y′) = k(x, x′) + k(y, y′)− k(x, y′)− k(x′, y),
and v := [x, x′, y, y′].
The linear time estimate again:
ηk =2
m
m/2∑
i=1
hk(vi),
where vi := [x2i−1, x2i, y2i−1, y2i] and
hk(vi) := k(x2i−1, x2i) + k(y2i−1, y2i)− k(x2i−1, y2i)− k(x2i, y2i−1)
Linear time vs quadratic time MMD
Disadvantages of linear time MMD vs quadratic time MMD
• Much higher variance for a given m, hence. . .
• . . .a much less powerful test for a given m
Linear time vs quadratic time MMD
Disadvantages of linear time MMD vs quadratic time MMD
• Much higher variance for a given m, hence. . .
• . . .a much less powerful test for a given m
Advantages of the linear time MMD vs quadratic time MMD
• Very simple asymptotic null distribution (a Gaussian, vs an infinite
weighted sum of χ2)
• Both test statistic and threshold computable in O(m), with storage O(1).
• Given unlimited data, a given Type II error can be attained with less
computation
Asymptotics of linear time MMD
By central limit theorem,
m1/2 (ηk − ηk(p, q))D→ N (0, 2σ2k)
• assuming 0 < E(h2k) <∞ (true for bounded k)
• σ2k = Evh2k(v)− [Ev(hk(v))]
2 .
Hypothesis test
Hypothesis test of asymptotic level α:
tk,α = m−1/2σk√2Φ−1(1− α) where Φ−1 is inverse CDF of N (0, 1).
−4 −2 0 2 4 6 80
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4Null distribution, linear time MMD
2
= ηk
P(η
k)
ηk
Type I error
tk,α = (1 − α) quantile
Type II error
−4 −2 0 2 4 6 80
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
P(η
k)
ηk
Null vs alternative distribution, P (ηk)
null
alternative
Type II error ηk(p, q )
The best kernel: minimizes Type II error
Type II error: ηk falls below the threshold tk,α and ηk(p, q) > 0.
Prob. of a Type II error:
P (ηk < tk,α) = Φ
(Φ−1(1− α)− ηk(p, q)
√m
σk√2
)
where Φ is a Normal CDF.
The best kernel: minimizes Type II error
Type II error: ηk falls below the threshold tk,α and ηk(p, q) > 0.
Prob. of a Type II error:
P (ηk < tk,α) = Φ
(Φ−1(1− α)− ηk(p, q)
√m
σk√2
)
where Φ is a Normal CDF.
Since Φ monotonic, best kernel choice to minimize Type II error prob. is:
k∗ = argmaxk∈K
ηk(p, q)σ−1k ,
where K is the family of kernels under consideration.
Learning the best kernel in a family
Define the family of kernels as follows:
K :=
{k : k =
d∑
u=1
βuku, ‖β‖1 = D, βu ≥ 0, ∀u ∈ {1, . . . , d}}.
Properties: if at least one βu > 0
• all k ∈ K are valid kernels,
• If all ku charateristic then k characteristic
Test statistic
The squared MMD becomes
ηk(p, q) = ‖µk(p)− µk(q)‖2Fk=
d∑
u=1
βuηu(p, q),
where ηu(p, q) := Evhu(v).
Test statistic
The squared MMD becomes
ηk(p, q) = ‖µk(p)− µk(q)‖2Fk=
d∑
u=1
βuηu(p, q),
where ηu(p, q) := Evhu(v).
Denote:
• β = (β1, β2, . . . , βd)⊤ ∈ Rd,
• h = (h1, h2, . . . , hd)⊤ ∈ Rd,
– hu(x, x′, y, y′) = ku(x, x
′) + ku(y, y′)− ku(x, y
′)− ku(x′, y)
• η = Ev(h) = (η1, η2, . . . , ηd)⊤ ∈ Rd.
Quantities for test:
ηk(p, q) = E(β⊤h) = β⊤η σ2k := β⊤cov(h)β.
Optimization of ratio ηk(p, q)σ−1k
Empirical test parameters:
ηk = β⊤η σk,λ =
√β⊤(Q+ λmI
)β,
Q is empirical estimate of cov(h).
Note: ηk, σk,λ computed on training data, vs ηk, σk on data to be tested
(why?)
Optimization of ratio ηk(p, q)σ−1k
Empirical test parameters:
ηk = β⊤η σk,λ =
√β⊤(Q+ λmI
)β,
Q is empirical estimate of cov(h).
Note: ηk, σk,λ computed on training data, vs ηk, σk on data to be tested
(why?)
Objective:
β∗ = argmaxβ�0
ηk(p, q)σ−1k,λ
= argmaxβ�0
(β⊤η
)(β⊤(Q+ λmI
)β)−1/2
=: α(β; η, Q)
Optmization of ratio ηk(p, q)σ−1k
Assume: η has at least one positive entry
Then there exists β � 0 s.t. α(β; η, Q) > 0.
Thus: α(β∗; η, Q) > 0
Optmization of ratio ηk(p, q)σ−1k
Assume: η has at least one positive entry
Then there exists β � 0 s.t. α(β; η, Q) > 0.
Thus: α(β∗; η, Q) > 0
Solve easier problem: β∗ = argmaxβ�0 α2(β; η, Q).
Quadratic program:
min{β⊤(Q+ λmI
)β : β⊤η = 1, β � 0}
Optmization of ratio ηk(p, q)σ−1k
Assume: η has at least one positive entry
Then there exists β � 0 s.t. α(β; η, Q) > 0.
Thus: α(β∗; η, Q) > 0
Solve easier problem: β∗ = argmaxβ�0 α2(β; η, Q).
Quadratic program:
min{β⊤(Q+ λmI
)β : β⊤η = 1, β � 0}
What if η has no positive entries?
Test procedure
1. Split the data into testing and training.
2. On the training data:
(a) Compute ηu for all ku ∈ K(b) If at least one ηu > 0, solve the QP to get β∗, else choose random
kernel from K3. On the test data:
(a) Compute ηk∗ using k∗ =∑d
u=1 β∗ku
(b) Compute test threshold tα,k∗ using σk∗
4. Reject null if ηk∗ > tα,k∗
Convergence bounds
Assume bounded kernel, σk, bounded away from 0.
If λm = Θ(m−1/3) then
∣∣∣∣supk∈K
ηkσ−1k,λ − sup
k∈Kηkσ
−1k
∣∣∣∣ = OP
(m−1/3
).
Convergence bounds
Assume bounded kernel, σk, bounded away from 0.
If λm = Θ(m−1/3) then
∣∣∣∣supk∈K
ηkσ−1k,λ − sup
k∈Kηkσ
−1k
∣∣∣∣ = OP
(m−1/3
).
Idea:
∣∣∣∣supk∈K
ηkσ−1k,λ − sup
k∈Kηkσ
−1k
∣∣∣∣
≤ supk∈K
∣∣∣ηkσ−1k,λ − ηkσ
−1k,λ
∣∣∣+ supk∈K
∣∣∣ηkσ−1k,λ − ηkσ
−1k
∣∣∣
≤√d
D√λm
(C1 sup
k∈K|ηk − ηk|+ C2 sup
k∈K|σk,λ − σk,λ|
)+ C3D
2λm,
Experiments
Competing approaches
• Median heuristic
• Max. MMD: choose ku ∈ K with the largest ηu
– same as maximizing β⊤η subject to ‖β‖1 ≤ 1
• ℓ2 statistic: maximize β⊤η subject to ‖β‖2 ≤ 1
• Cross validation on training set
Also compare with:
• Single kernel that maximizes ratio ηk(p, q)σ−1k
Blobs: data
Difficult problems: lengthscale of the difference in distributions not the same
as that of the distributions.
Blobs: data
Difficult problems: lengthscale of the difference in distributions not the same
as that of the distributions.
We distinguish a field of Gaussian blobs with different covariances.
5 10 15 20 25 30 355
10
15
20
25
30
35Blob data p
x1
x2
5 10 15 20 25 30 355
10
15
20
25
30
35Blob data q
y1
y2
Ratio ε = 3.2 of largest to smallest eigenvalues of blobs in q.
Blobs: results
0 5 10 150
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ε ratio
Typ
eII
erro
r
max ratio
opt
l2
maxmmd
xval
xvalc
med
Parameters: m = 10, 000 (for training and test). Ratio ε of largest to
smallest eigenvalues of blobs in q. Results are average over 617 trials.
Blobs: results
0 5 10 150
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ε ratio
Type II err
or
max ratiooptl2maxmmdxvalxvalcmed
Optimize ratio ηk(p, q)σ−1k
Blobs: results
0 5 10 150
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ε ratio
Type II err
or
max ratiooptl2maxmmdxvalxvalcmed
Maximize ηk(p, q) with β constraint
Blobs: results
0 5 10 150
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ε ratio
Type II err
or
max ratiooptl2maxmmdxvalxvalcmed
Median heuristic
Feature selection: data
Idea: no single best kernel.
Each of the ku are univariate (along a single coordinate)
Feature selection: data
Idea: no single best kernel.
Each of the ku are univariate (along a single coordinate)
−4 −2 0 2 4 6 8
−4
−2
0
2
4
6
8Selection data
x1
x2
p
q
Feature selection: results
0 5 10 15 20 25 300.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
Feature selection
dimension
Typ
e I
I e
rro
r
max ratio
opt
l2
maxmmd
Single best kernel
Linear combination
m = 10, 000, average over 5000 trials
Amplitude modulated signals
Given an audio signal s(t), an amplitude modulated signal can be defined
u(t) = sin(ωct) [a s(t) + l]
• ωc: carrier frequency
• a = 0.2 is signal scaling, l = 2 is offset
Amplitude modulated signals
Given an audio signal s(t), an amplitude modulated signal can be defined
u(t) = sin(ωct) [a s(t) + l]
• ωc: carrier frequency
• a = 0.2 is signal scaling, l = 2 is offset
Two amplitude modulated signals from same artist (in this case, Magnetic
Fields).
• Music sampled at 8KHz (very low)
• Carrier frequency is 24kHz
• AM signal observed at 120kHz
• Samples are extracts of length N = 1000, approx. 0.01 sec (very short).
• Total dataset size is 30,000 samples from each of p, q.
Amplitude modulated signals
Samples from P Samples from Q
Results: AM signals
−0.2 0 0.2 0.4 0.6 0.8 1 1.20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
added noise
Typ
e I
I e
rro
r
max ratio
opt
med
l2
maxmmd
m = 10, 000 (for training and test) and scaling a = 0.5. Average over 4124
trials. Gaussian noise added.
Observations on kernel choice
• It is possible to choose the best kernel for a kernel
two-sample test
• Kernel choice matters for “difficult” problems, where the
distributions differ on a lengthscale different to that of
the data.
• Ongoing work:
– quadratic time statistic
– avoid training/test split
Testing Independence
MMD for independence
• Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05,
NIPS07a, ALT07, ALT08, JMLR10]
Related to [Feuerverger, 1993]and [Szekely and Rizzo, 2009, Szekely et al., 2007]
HSIC(PXY ,PXPY ) := ‖µPXY− µPXPY
‖2
MMD for independence
• Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05,
NIPS07a, ALT07, ALT08, JMLR10]
Related to [Feuerverger, 1993]and [Szekely and Rizzo, 2009, Szekely et al., 2007]
HSIC(PXY ,PXPY ) := ‖µPXY− µPXPY
‖2
k( , )!" #" !"l( , )#"
k( , )× l( , )!" #" !" #"
κ( , ) =!" #"!" #"
MMD for independence
• Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05,
NIPS07a, ALT07, ALT08, JMLR10]
Related to [Feuerverger, 1993]and [Szekely and Rizzo, 2009, Szekely et al., 2007]
HSIC(PXY ,PXPY ) := ‖µPXY− µPXPY
‖2
HSIC using expectations of kernels:
Define RKHS F on X with kernel k, RKHS G on Y with kernel l. Then
HSIC(PXY ,PXPY )
= EXY EX′Y ′k(x, x′)l(y, y′) +EXEX′k(x, x′)EY EY ′ l(y, y′)
− 2EX′Y ′
[EXk(x, x
′)EY l(y, y′)].
Experiment: dependence testing for translation
• Translation example: [NIPS07b]
Canadian Hansard
(agriculture)
• 5-line extracts,
k-spectrum kernel, k = 10,
repetitions=300,
sample size 10
• Empirical
MMD(PXY ,PXPY ):
1
n2(HKH ◦HLH)++
⇓
K
⇒MMD⇐
⇓
L
• k-spectrum kernel: average Type II error 0 (α = 0.05)
• Bag of words kernel: average Type II error 0.18
Lancaster (3-way) Interactions
Detecting a higher order interaction
• How to detect V-structures with pairwise weak individual dependence?
X Y
Z
Detecting a higher order interaction
• How to detect V-structures with pairwise weak individual dependence?
Detecting a higher order interaction
• How to detect V-structures with pairwise weak individual dependence?
• X ⊥⊥ Y , Y ⊥⊥ Z, X ⊥⊥ Z
X vs Y Y vs Z
X vs Z XY vs Z
X Y
Z
• X,Yi.i.d.∼ N (0, 1),
• Z| X,Y ∼ sign(XY )Exp( 1√2)
Faithfulness violated here
V-structure Discovery
X Y
Z
Assume X ⊥⊥ Y has been established. V-structure can then be detected by:
• Consistent CI test: H0 : X ⊥⊥ Y |Z [Fukumizu et al., 2008, Zhang et al., 2011], or
V-structure Discovery
X Y
Z
Assume X ⊥⊥ Y has been established. V-structure can then be detected by:
• Consistent CI test: H0 : X ⊥⊥ Y |Z [Fukumizu et al., 2008, Zhang et al., 2011], or
• Factorisation test: H0 : (X,Y ) ⊥⊥ Z ∨ (X,Z) ⊥⊥ Y ∨ (Y, Z) ⊥⊥ X
(multiple standard two-variable tests)
– compute p-values for each of the marginal tests for (Y, Z) ⊥⊥ X,
(X,Z) ⊥⊥ Y , or (X,Y ) ⊥⊥ Z
– apply Holm-Bonferroni (HB) sequentially rejective correction(Holm 1979)
V-structure Discovery (2)
• How to detect V-structures with pairwise weak (or nonexistent)
dependence?
• X ⊥⊥ Y , Y ⊥⊥ Z, X ⊥⊥ Z
X1 vs Y1 Y1 vs Z1
X1 vs Z1 X1*Y1 vs Z1
X Y
Z
• X1, Y1i.i.d.∼ N (0, 1),
• Z1| X1, Y1 ∼ sign(X1Y1)Exp(1√2)
• X2:p, Y2:p, Z2:pi.i.d.∼ N (0, Ip−1) Faith-
fulness violated here
V-structure Discovery (3)
CI: X ⊥⊥Y |Z
2var: Factor
Null
acc
epta
nce
rate
(Typ
eII
erro
r)V-structure discovery: Dataset A
Dimension
1 3 5 7 9 11 13 15 17 19
0
0.2
0.4
0.6
0.8
1
Figure 1: CI test for X ⊥⊥ Y |Z from Zhang et al (2011), and a factorisation test
with a HB correction, n = 500
Lancaster Interaction Measure
[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is
a signed measure ∆P that vanishes whenever P can be factorised in a
non-trivial way as a product of its (possibly multivariate) marginal
distributions.
• D = 2 : ∆LP = PXY − PXPY
Lancaster Interaction Measure
[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is
a signed measure ∆P that vanishes whenever P can be factorised in a
non-trivial way as a product of its (possibly multivariate) marginal
distributions.
• D = 2 : ∆LP = PXY − PXPY
• D = 3 : ∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ
Lancaster Interaction Measure
[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is
a signed measure ∆P that vanishes whenever P can be factorised in a
non-trivial way as a product of its (possibly multivariate) marginal
distributions.
• D = 2 : ∆LP = PXY − PXPY
• D = 3 : ∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ
X Y
Z
X Y
Z
X Y
Z
X Y
Z
PXY Z −PXPY Z −PY PXZ −PZPXY +2PXPY PZ
∆LP =
Lancaster Interaction Measure
[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is
a signed measure ∆P that vanishes whenever P can be factorised in a
non-trivial way as a product of its (possibly multivariate) marginal
distributions.
• D = 2 : ∆LP = PXY − PXPY
• D = 3 : ∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ
X Y
Z
X Y
Z
X Y
Z
X Y
Z
PXY Z −PXPY Z −PY PXZ −PZPXY +2PXPY PZ
∆LP = 0
Case of PX ⊥⊥ PY Z
Lancaster Interaction Measure
[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is
a signed measure ∆P that vanishes whenever P can be factorised in a
non-trivial way as a product of its (possibly multivariate) marginal
distributions.
• D = 2 : ∆LP = PXY − PXPY
• D = 3 : ∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ
(X,Y ) ⊥⊥ Z ∨ (X,Z) ⊥⊥ Y ∨ (Y, Z) ⊥⊥ X ⇒ ∆LP = 0.
...so what might be missed?
Lancaster Interaction Measure
[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is
a signed measure ∆P that vanishes whenever P can be factorised in a
non-trivial way as a product of its (possibly multivariate) marginal
distributions.
• D = 2 : ∆LP = PXY − PXPY
• D = 3 : ∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ
∆LP = 0 ; (X,Y ) ⊥⊥ Z ∨ (X,Z) ⊥⊥ Y ∨ (Y, Z) ⊥⊥ X
Example:
P (0, 0, 0) = 0.2 P (0, 0, 1) = 0.1 P (1, 0, 0) = 0.1 P (1, 0, 1) = 0.1
P (0, 1, 0) = 0.1 P (0, 1, 1) = 0.1 P (1, 1, 0) = 0.1 P (1, 1, 1) = 0.2
A Test using Lancaster Measure
• Test statistic is empirical estimate of ‖µκ (∆LP )‖2Hκ, where
κ = k ⊗ l ⊗m:
‖µκ(PXY Z − PXY PZ − · · · )‖2Hκ=
〈µκPXY Z , µκPXY Z〉Hκ− 2 〈µκPXY Z , µκPXY PZ〉Hκ
· · ·
Inner Product Estimators
ν\ν′ PXY Z PXY PZ PXZPY PY ZPX PXPY PZ
PXY Z (K ◦ L ◦ M)++ ((K ◦ L)M)++ ((K ◦ M)L)++ ((M ◦ L)K)++ tr(K+ ◦ L+ ◦ M+)
PXY PZ (K ◦ L)++ M++ (MKL)++ (KLM)++ (KL)++M++
PXZPY (K ◦ M)++ L++ (KML)++ (KM)++L++
PY ZPX (L ◦ M)++ K++ (LM)++K++
PXPY PZ K++L++M++
Table 1: V -statistic estimators of 〈µκν, µκν ′〉Hκ
Inner Product Estimators
ν\ν′ PXY Z PXY PZ PXZPY PY ZPX PXPY PZ
PXY Z (K ◦ L ◦ M)++ ((K ◦ L)M)++ ((K ◦ M)L)++ ((M ◦ L)K)++ tr(K+ ◦ L+ ◦ M+)
PXY PZ (K ◦ L)++ M++ (MKL)++ (KLM)++ (KL)++M++
PXZPY (K ◦ M)++ L++ (KML)++ (KM)++L++
PY ZPX (L ◦ M)++ K++ (LM)++K++
PXPY PZ K++L++M++
Table 2: V -statistic estimators of 〈µκν, µκν ′〉Hκ
‖µκ (∆LP )‖2Hκ=
1
n2(HKH ◦HLH ◦HMH)++ .
Empirical joint central moment in the feature space
Example A: factorisation tests
CI: X ⊥⊥Y |Z
∆L: Factor
2var: Factor
Null
acc
epta
nce
rate
(Typ
eII
erro
r)
V-structure discovery: Dataset A
Dimension
1 3 5 7 9 11 13 15 17 19
0
0.2
0.4
0.6
0.8
1
Figure 2: Factorisation hypothesis: Lancaster statistic vs. a two-variable
based test (both with HB correction); Test for X ⊥⊥ Y |Z from Zhang et al
(2011), n = 500
Example B: Joint dependence can be easier to detect
• X1, Y1i.i.d.∼ N (0, 1)
• Z1 =
X21 + ǫ, w.p. 1/3,
Y 21 + ǫ, w.p. 1/3,
X1Y1 + ǫ, w.p. 1/3,
where ǫ ∼ N (0, 0.12).
• X2:p, Y2:p, Z2:pi.i.d.∼ N (0, Ip−1)
• dependence of Z on pair (X,Y ) is stronger than on X and Y individually
• Satisfies faithfulness
Example B: factorisation tests
CI: X ⊥⊥Y |Z
∆L: Factor
2var: Factor
Null
acc
epta
nce
rate
(Typ
eII
erro
r)
V-structure discovery: Dataset B
Dimension
1 3 5 7 9 11 13 15 17 19
0
0.2
0.4
0.6
0.8
1
Figure 3: Factorisation hypothesis: Lancaster statistic vs. a two-variable
based test (both with HB correction); Test for X ⊥⊥ Y |Z from Zhang et al
(2011), n = 500
Interaction for D ≥ 4
• Interaction measure valid for all D
(Streitberg, 1990):
∆SP =∑
π
(−1)|π|−1 (|π| − 1)!JπP
– For a partition π, Jπ associates to
the joint the corresponding
factorisation, e.g.,
J13|2|4P = PX1X3PX2PX4 .
Interaction for D ≥ 4
• Interaction measure valid for all D
(Streitberg, 1990):
∆SP =∑
π
(−1)|π|−1 (|π| − 1)!JπP
– For a partition π, Jπ associates to
the joint the corresponding
factorisation, e.g.,
J13|2|4P = PX1X3PX2PX4 .
Interaction for D ≥ 4
• Interaction measure valid for all D
(Streitberg, 1990):
∆SP =∑
π
(−1)|π|−1 (|π| − 1)!JπP
– For a partition π, Jπ associates to
the joint the corresponding
factorisation, e.g.,
J13|2|4P = PX1X3PX2PX4 .
1e+04
1e+09
1e+14
1e+19
1 3 5 7 9 11 13 15 17 19 21 23 25D
Nu
mb
er
of
pa
rtitio
ns o
f {1
,...
,D} Bell numbers growth
joint central moments (Lancaster interaction)
vs.
joint cumulants (Streitberg interaction)
Total independence test
• Total independence test:
H0 : PXY Z = PXPY PZ vs. H1 : PXY Z 6= PXPY PZ
Total independence test
• Total independence test:
H0 : PXY Z = PXPY PZ vs. H1 : PXY Z 6= PXPY PZ
• For (X1, . . . , XD) ∼ PX, and κ =⊗D
i=1 k(i):
∥∥∥∥∥∥∥∥∥∥∥
µκ
(PX −
D∏
i=1
PXi
)
︸ ︷︷ ︸∆totP
∥∥∥∥∥∥∥∥∥∥∥
2
Hκ
=1
n2
n∑
a=1
n∑
b=1
D∏
i=1
K(i)ab − 2
nD+1
n∑
a=1
D∏
i=1
n∑
b=1
K(i)ab
+1
n2D
D∏
i=1
n∑
a=1
n∑
b=1
K(i)ab .
• Coincides with the test proposed by Kankainen (1995) using empirical
characteristic functions: similar relationship to that between dCov and
HSIC (DS et al, 2013)
Example B: total independence tests
∆tot : total indep.
∆L: total indep.
Null
acc
epta
nce
rate
(Typ
eII
erro
r)
Total independence test: Dataset B
Dimension
1 3 5 7 9 11 13 15 17 19
0
0.2
0.4
0.6
0.8
1
Figure 4: Total independence: ∆totP vs. ∆LP , n = 500
Nonparametric Bayesian inference using distributionembeddings
Motivating Example: Bayesian inference without a model
• 3600 downsampled frames of
20× 20 RGB pixels
(Yt ∈ [0, 1]1200)
• 1800 training frames,
remaining for test.
• Gaussian noise added to Yt.
Challenges:
• No parametric model of camera dynamics (only samples)
• No parametric model of map from camera angle to image (only samples)
• Want to do filtering: Bayesian inference
ABC: an approach to Bayesian inference without a model
Bayes rule:
P(y|x) = P(x|y)π(y)∫P(x|y)π(y)dy
• P(x|y) is likelihood• π(y) is prior
One approach: Approximate Bayesian Computation (ABC)
ABC: an approach to Bayesian inference without a model
Approximate Bayesian Computation (ABC):
−10 −5 0 5 10
−10
−5
0
5
10
Prior y ∼ π(Y )
Lik
eli
hood
x∼
P(X
|y)
ABC demonstration
ABC: an approach to Bayesian inference without a model
Approximate Bayesian Computation (ABC):
−10 −5 0 5 10
−10
−5
0
5
10
Prior y ∼ π(Y )
Lik
eli
hood
x∼
P(X
|y)
ABC demonstration
x∗
ABC: an approach to Bayesian inference without a model
Approximate Bayesian Computation (ABC):
−10 −5 0 5 10
−10
−5
0
5
10
Prior y ∼ π(Y )
Lik
eli
hood
x∼
P(X
|y)
ABC demonstration
x∗
ABC: an approach to Bayesian inference without a model
Approximate Bayesian Computation (ABC):
−10 −5 0 5 10
−10
−5
0
5
10
Prior y ∼ π(Y )
Lik
eli
hood
x∼
P(X
|y)
ABC demonstration
x∗
ABC: an approach to Bayesian inference without a model
Approximate Bayesian Computation (ABC):
−10 −5 0 5 10
−10
−5
0
5
10
Prior y ∼ π(Y )
Lik
eli
hood
x∼
P(X
|y)
ABC demonstration
x∗
P (Y |x∗)
ABC: an approach to Bayesian inference without a model
Approximate Bayesian Computation (ABC):
−10 −5 0 5 10
−10
−5
0
5
10
Prior y ∼ π(Y )
Lik
eli
hood
x∼
P(X
|y)
ABC demonstration
x∗
P (Y |x∗)
Needed: distance measure D, tolerance parameter τ .
ABC: an approach to Bayesian inference without a model
Bayes rule:
P(y|x) = P(x|y)π(y)∫P(x|y)π(y)dy
• P(x|y) is likelihood• π(y) is prior
ABC generates a sample from p(Y |x∗) as follows:1. generate a sample yt from the prior π,
2. generate a sample xt from P(X|yt),3. if D(x∗, xt) < τ , accept y = yt; otherwise reject,
4. go to (i).
In step (3), D is a distance measure, and τ is a tolerance parameter.
Motivating example 2: simple Gaussian case
• p(x, y) is N ((0,1Td )T , V ) with V a randomly generated covariance
Posterior mean on x: ABC vs kernel approach
100
101
102
103
104
10−2
10−1
CPU time vs Error (6 dim.)
CPU time (sec)
Mea
n S
qu
are
Err
ors
KBI
COND
ABC
3.3×102
1.3×106
7.0×104
1.0×104
2.5×103
9.5×102
200
400
2000
600800
15001000
50006000
40003000
200
400
2000
4000
600800
1000
1500
6000
30005000
Bayes again
Bayes rule:
P(y|x) = P(x|y)π(y)∫P(x|y)π(y)dy
• P(x|y) is likelihood• π is prior
How would this look with kernel embeddings?
Bayes again
Bayes rule:
P(y|x) = P(x|y)π(y)∫P(x|y)π(y)dy
• P(x|y) is likelihood• π is prior
How would this look with kernel embeddings?
Define RKHS G on Y with feature map ψy and kernel l(y, ·)
We need a conditional mean embedding: for all g ∈ G,
EY |x∗g(Y ) = 〈g, µP(y|x∗)〉G
This will be obtained by RKHS-valued ridge regression
Ridge regression and the conditional feature mean
Ridge regression from X := Rd to a finite vector output Y := Rd′ (these
could be d′ nonlinear features of y):
Define training data
X =[x1 . . . xm
]∈ R
d×m Y =[y1 . . . ym
]∈ R
d′×m
Ridge regression and the conditional feature mean
Ridge regression from X := Rd to a finite vector output Y := Rd′ (these
could be d′ nonlinear features of y):
Define training data
X =[x1 . . . xm
]∈ R
d×m Y =[y1 . . . ym
]∈ R
d′×m
Solve
A = argminA∈Rd′×d
(‖Y −AX‖2 + λ‖A‖2HS
),
where
‖A‖2HS = tr(A⊤A) =min{d,d′}∑
i=1
γ2A,i
Ridge regression and the conditional feature mean
Ridge regression from X := Rd to a finite vector output Y := Rd′ (these
could be d′ nonlinear features of y):
Define training data
X =[x1 . . . xm
]∈ R
d×m Y =[y1 . . . ym
]∈ R
d′×m
Solve
A = argminA∈Rd′×d
(‖Y −AX‖2 + λ‖A‖2HS
),
where
‖A‖2HS = tr(A⊤A) =min{d,d′}∑
i=1
γ2A,i
Solution: A = CY X (CXX +mλI)−1
Ridge regression and the conditional feature mean
Prediction at new point x:
y∗ = Ax
= CY X (CXX +mλI)−1 x
=m∑
i=1
βi(x)yi
where
βi(x) = (K + λmI)−1[k(x1, x) . . . k(xm, x)
]⊤
and
K := X⊤X k(x1, x) = x⊤1 x
Ridge regression and the conditional feature mean
Prediction at new point x:
y∗ = Ax
= CY X (CXX +mλI)−1 x
=m∑
i=1
βi(x)yi
where
βi(x) = (K + λmI)−1[k(x1, x) . . . k(xm, x)
]⊤
and
K := X⊤X k(x1, x) = x⊤1 x
What if we do everything in kernel space?
Ridge regression and the conditional feature mean
Recall our setup:
• Given training pairs:
(xi, yi) ∼ PXY
• F on X with feature map ϕx and kernel k(x, ·)• G on Y with feature map ψy and kernel l(y, ·)
We define the covariance between feature maps:
CXX = EX (ϕX ⊗ ϕX) CXY = EXY (ϕX ⊗ ψY )
and matrices of feature mapped training data
X =[ϕx1 . . . ϕxm
]Y :=
[ψy1 . . . ψym
]
Ridge regression and the conditional feature mean
Objective: [Weston et al. (2003), Micchelli and Pontil (2005), Caponnetto and De Vito (2007), Grunewalder
et al. (2012, 2013) ]
A = arg minA∈HS(F ,G)
(EXY ‖Y −AX‖2G + λ‖A‖2HS
), ‖A‖2HS =
∞∑
i=1
γ2A,i
Solution same as vector case:
A = CY X (CXX +mλI)−1 ,
Prediction at new x using kernels:
Aϕx =[ψy1 . . . ψym
](K + λmI)−1
[k(x1, x) . . . k(xm, x)
]
=
m∑
i=1
βi(x)ψyi
where Kij = k(xi, xj)
Ridge regression and the conditional feature mean
How is loss ‖Y −AX‖2G relevant to conditional expectation of some
EY |xg(Y )? Define: [Song et al. (2009), Grunewalder et al. (2013)]
µY |x := Aϕx
Ridge regression and the conditional feature mean
How is loss ‖Y −AX‖2G relevant to conditional expectation of some
EY |xg(Y )? Define: [Song et al. (2009), Grunewalder et al. (2013)]
µY |x := Aϕx
We need A to have the property
EY |xg(Y ) ≈ 〈g, µY |x〉G= 〈g,Aϕx〉G= 〈A∗g, ϕx〉F = (A∗g)(x)
Ridge regression and the conditional feature mean
How is loss ‖Y −AX‖2G relevant to conditional expectation of some
EY |xg(Y )? Define: [Song et al. (2009), Grunewalder et al. (2013)]
µY |x := Aϕx
We need A to have the property
EY |xg(Y ) ≈ 〈g, µY |x〉G= 〈g,Aϕx〉G= 〈A∗g, ϕx〉F = (A∗g)(x)
Natural risk function for conditional mean
L(A,PXY ) := sup‖g‖≤1
EX
(EY |Xg(Y )
)︸ ︷︷ ︸
Target
(X)− (A∗g)︸ ︷︷ ︸Estimator
(X)
2
,
Ridge regression and the conditional feature mean
The squared loss risk provides an upper bound on the natural risk.
L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G
Ridge regression and the conditional feature mean
The squared loss risk provides an upper bound on the natural risk.
L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G
Proof: Jensen and Cauchy Schwarz
L(A,PXY ) := sup‖g‖≤1
EX
[(EY |Xg(Y )
)(X)− (A∗g) (X)
]2
≤ EXY sup‖g‖≤1
[g(Y )− (A∗g) (X)]2
Ridge regression and the conditional feature mean
The squared loss risk provides an upper bound on the natural risk.
L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G
Proof: Jensen and Cauchy Schwarz
L(A,PXY ) := sup‖g‖≤1
EX
[(EY |Xg(Y )
)(X)− (A∗g) (X)
]2
≤ EXY sup‖g‖≤1
[g(Y )− (A∗g) (X)]2
= EXY sup‖g‖≤1
[〈g, ψY 〉G − 〈A∗g, ϕX〉F ]2
Ridge regression and the conditional feature mean
The squared loss risk provides an upper bound on the natural risk.
L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G
Proof: Jensen and Cauchy Schwarz
L(A,PXY ) := sup‖g‖≤1
EX
[(EY |Xg(Y )
)(X)− (A∗g) (X)
]2
≤ EXY sup‖g‖≤1
[g(Y )− (A∗g) (X)]2
= EXY sup‖g‖≤1
[〈g, ψY 〉G − 〈g,AϕX〉G ]2
Ridge regression and the conditional feature mean
The squared loss risk provides an upper bound on the natural risk.
L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G
Proof: Jensen and Cauchy Schwarz
L(A,PXY ) := sup‖g‖≤1
EX
[(EY |Xg(Y )
)(X)− (A∗g) (X)
]2
≤ EXY sup‖g‖≤1
[g(Y )− (A∗g) (X)]2
= EXY sup‖g‖≤1
〈g, ψY −AϕX〉2G
Ridge regression and the conditional feature mean
The squared loss risk provides an upper bound on the natural risk.
L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G
Proof: Jensen and Cauchy Schwarz
L(A,PXY ) := sup‖g‖≤1
EX
[(EY |Xg(Y )
)(X)− (A∗g) (X)
]2
≤ EXY sup‖g‖≤1
[g(Y )− (A∗g) (X)]2
= EXY sup‖g‖≤1
〈g, ψY −AϕX〉2G
≤ EXY ‖ψY −AϕX‖2G
Ridge regression and the conditional feature mean
The squared loss risk provides an upper bound on the natural risk.
L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G
Proof: Jensen and Cauchy Schwarz
L(A,PXY ) := sup‖g‖≤1
EX
[(EY |Xg(Y )
)(X)− (A∗g) (X)
]2
≤ EXY sup‖g‖≤1
[g(Y )− (A∗g) (X)]2
= EXY sup‖g‖≤1
〈g, ψY −AϕX〉2G
≤ EXY ‖ψY −AϕX‖2G
If we assume EY [g(Y )|X = x] ∈ F then upper bound tight (next slide).
Conditions for ridge regression = conditional mean
Conditional mean obtained by ridge regression when EY [g(Y )|X = x] ∈ FGiven a function g ∈ G. Assume EY |X [g(Y )|X = ·] ∈ F . Then
CXXEY |X [g(Y )|X = ·] = CXY g.
Why this is useful:
EY |X [g(Y )|X = x] = 〈EY |X [g(Y )|X = ·], ϕx〉F= 〈C−1
XXCXY g, ϕx〉F= 〈g, CY XC
−1XX︸ ︷︷ ︸
regression
ϕx〉G
Conditions for ridge regression = conditional mean
Conditional mean obtained by ridge regression when EY [g(Y )|X = x] ∈ FGiven a function g ∈ G. Assume EY |X [g(Y )|X = ·] ∈ F . Then
CXXEY |X [g(Y )|X = ·] = CXY g.
Proof : [Fukumizu et al., 2004]
For all f ∈ F , by definition of CXX ,
⟨f, CXXEY |X [g(Y )|X = ·]
⟩F
= cov(f,EY |X [g(Y )|X = ·]
)
= EX
(f(X) EY |X [g(Y )|X]
)
= EXY (f(X)g(Y ))
= 〈f, CXY g〉 ,
by definition of CXY .
Kernel Bayes’ law
• Prior: Y ∼ π(y)
• Likelihood: (X|y) ∼ P(x|y) with some joint P(x, y)
Kernel Bayes’ law
• Prior: Y ∼ π(y)
• Likelihood: (X|y) ∼ P(x|y) with some joint P(x, y)
• Joint distribution: Q(x, y) = P(x|y)π(y)
Warning: Q 6= P, change of measure from P(y) to π(y)
• Marginal for x:
Q(x) :=
∫P(x|y)π(y)dy.
• Bayes’ law:
Q(y|x) = P(x|y)π(y)Q(x)
Kernel Bayes’ law
• Posterior embedding via the usual conditional update,
µQ(y|x) = CQ(y,x)C−1Q(x,x)φx.
Kernel Bayes’ law
• Posterior embedding via the usual conditional update,
µQ(y|x) = CQ(y,x)C−1Q(x,x)φx.
• Given mean embedding of prior: µπ(y)
• Define marginal covariance:
CQ(x,x) =
∫(ϕx ⊗ ϕx) P(x|y)π(y)dx = C(xx)yC
−1yy µπ(y)
• Define cross-covariance:
CQ(y,x) =
∫(φy ⊗ ϕx) P(x|y)π(y)dx = C(yx)yC
−1yy µπ(y).
Kernel Bayes’ law: consistency result
• How to compute posterior expectation from data?
• Given samples: {(xi, yi)}ni=1 from Pxy, {(uj)}nj=1 from prior π.
• Want to compute E[g(Y )|X = x] for g in G• For any x ∈ X ,
∣∣gTy RY |XkX(x)− E[f(Y )|X = x]
∣∣ = Op(n− 4
27 ), (n→ ∞),
where
– gy = (g(y1), . . . , g(yn))T ∈ Rn.
– kX(x) = (k(x1, x), . . . , k(xn, x))T ∈ Rn
– RY |X learned from the samples, contains the uj
Smoothness assumptions:
• π/pY ∈ R(C1/2Y Y ), where pY p.d.f. of PY ,
• E[g(Y )|X = ·] ∈ R(C2Q(xx)
).
Experiment: Kernel Bayes’ law vs EKF
Experiment: Kernel Bayes’ law vs EKF
• Compare with extended Kalman
filter (EKF) on camera
orientation task
• 3600 downsampled frames of
20× 20 RGB pixels
(Yt ∈ [0, 1]1200)
• 1800 training frames, remaining
for test.
• Gaussian noise added to Yt.
Experiment: Kernel Bayes’ law vs EKF
• Compare with extended Kalman
filter (EKF) on camera
orientation task
• 3600 downsampled frames of
20× 20 RGB pixels
(Yt ∈ [0, 1]1200)
• 1800 training frames, remaining
for test.
• Gaussian noise added to Yt.
Average MSE and standard errors (10 runs)
KBR (Gauss) KBR (Tr) Kalman (9 dim.) Kalman (Quat.)
σ2 = 10−4 0.210± 0.015 0.146± 0.003 1.980± 0.083 0.557± 0.023
σ2 = 10−3 0.222± 0.009 0.210± 0.008 1.935± 0.064 0.541± 0.022
Co-authors
• From UCL:
– Kacper Chwialkowski
– Heiko Strathmann
• External:
– Wicher Bergsma, LSE
– Karsten Borgwardt, ETH
Zurich
– Kenji Fukumizu, ISM
– Bernhard Schoelkopf, MPI
– Alexander Smola,
CMU/Google
– Le Song, Georgia Tech
– Dino Sejdinovic, Oxford
– Bharath Sriperumbudur,
Penn State
Questions?
Energy Distance and the MMD
Energy distance and MMD
Distance between probability distributions:
Energy distance: [Baringhaus and Franz, 2004, Szekely and Rizzo, 2004, 2005]
DE(P,Q) = EP‖X −X ′‖+EQ‖Y − Y ′‖ − 2EP,Q‖X − Y ‖
Maximum mean discrepancy [Gretton et al., 2007, Smola et al., 2007, Gretton et al., 2012a]
MMD2(P,Q;F ) = EPk(X,X′) +EQk(Y, Y
′)− 2EP,Qk(X,Y )
Energy distance and MMD
Distance between probability distributions:
Energy distance: [Baringhaus and Franz, 2004, Szekely and Rizzo, 2004, 2005]
DE(P,Q) = EP‖X −X ′‖+EQ‖Y − Y ′‖ − 2EP,Q‖X − Y ‖
Maximum mean discrepancy [Gretton et al., 2007, Smola et al., 2007, Gretton et al., 2012a]
MMD2(P,Q;F ) = EPk(X,X′) +EQk(Y, Y
′)− 2EP,Qk(X,Y )
Energy distance is MMD with a particular kernel!
Distance covariance and HSIC
Distance covariance [Szekely and Rizzo, 2009, Szekely et al., 2007]
V2(X,Y ) = EXY EX′Y ′
[‖X −X ′‖‖Y − Y ′‖
]
+ EXEX′‖X −X ′‖EY EY ′‖Y − Y ′‖− 2EXY
[EX′‖X −X ′‖EY ′‖Y − Y ′‖
]
Hilbert-Schmdit Indepence Criterion [Gretton et al., 2005, Smola et al., 2007, Gretton et al.,
2008, Gretton and Gyorfi, 2010]Define RKHS F on X with kernel k, RKHS G on Ywith kernel l. Then
HSIC(PXY ,PXPY )
= EXY EX′Y ′k(X,X ′)l(Y, Y ′) +EXEX′k(X,X ′)EY EY ′ l(Y, Y ′)
− 2EX′Y ′
[EXk(X,X
′)EY l(Y, Y′)].
Distance covariance and HSIC
Distance covariance [Szekely and Rizzo, 2009, Szekely et al., 2007]
V2(X,Y ) = EXY EX′Y ′
[‖X −X ′‖‖Y − Y ′‖
]
+ EXEX′‖X −X ′‖EY EY ′‖Y − Y ′‖− 2EXY
[EX′‖X −X ′‖EY ′‖Y − Y ′‖
]
Hilbert-Schmdit Indepence Criterion [Gretton et al., 2005, Smola et al., 2007, Gretton et al.,
2008, Gretton and Gyorfi, 2010]Define RKHS F on X with kernel k, RKHS G on Ywith kernel l. Then
HSIC(PXY ,PXPY )
= EXY EX′Y ′k(X,X ′)l(Y, Y ′) +EXEX′k(X,X ′)EY EY ′ l(Y, Y ′)
− 2EX′Y ′
[EXk(X,X
′)EY l(Y, Y′)].
Distance covariance is HSIC with particular kernels!
Semimetrics and Hilbert spaces
Theorem [Berg et al., 1984, Lemma 2.1, p. 74]
ρ : X × X → R is a semimetric on X . Let z0 ∈ X , and denote
kρ(z, z′) = ρ(z, z0) + ρ(z′, z0)− ρ(z, z′).
Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS)
iff ρ is of negative type.
Call kρ a distance induced kernel
Semimetrics and Hilbert spaces
Theorem [Berg et al., 1984, Lemma 2.1, p. 74]
ρ : X × X → R is a semimetric on X . Let z0 ∈ X , and denote
kρ(z, z′) = ρ(z, z0) + ρ(z′, z0)− ρ(z, z′).
Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS)
iff ρ is of negative type.
Call kρ a distance induced kernel
Special case: Z ⊆ Rd and ρq(z, z′) = ‖z − z′‖q. Then ρq is a valid semimetric
of negative type for 0 < q ≤ 2.
Semimetrics and Hilbert spaces
Theorem [Berg et al., 1984, Lemma 2.1, p. 74]
ρ : X × X → R is a semimetric on X . Let z0 ∈ X , and denote
kρ(z, z′) = ρ(z, z0) + ρ(z′, z0)− ρ(z, z′).
Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS)
iff ρ is of negative type.
Call kρ a distance induced kernel
Special case: Z ⊆ Rd and ρq(z, z′) = ‖z − z′‖q. Then ρq is a valid semimetric
of negative type for 0 < q ≤ 2.
Energy distance is MMD with a distance induced kernel
Distance covariance is HSIC with distance induced kernels
Two-sample testing benchmark
Two-sample testing example in 1-D:
−6 −4 −2 0 2 4 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
X
P(X
)
VS
−6 −4 −2 0 2 4 6
0.1
0.2
0.3
0.4
X
Q(X
)
−6 −4 −2 0 2 4 6
0.1
0.2
0.3
0.4
X
Q(X
)
−6 −4 −2 0 2 4 6
0.1
0.2
0.3
0.4
X
Q(X
)
Two-sample test, MMD with distance kernel
Obtain more powerful tests on this problem when q 6= 1 (exponent of distance)
Key:
• Gaussian kernel
• q = 1
• Best: q = 1/3
• Worst: q = 2
Selected references
Characteristic kernels and mean embeddings:
• Smola, A., Gretton, A., Song, L., Schoelkopf, B. (2007). A hilbert space embedding for distributions. ALT.
• Sriperumbudur, B., Gretton, A., Fukumizu, K., Schoelkopf, B., Lanckriet, G. (2010). Hilbert space
embeddings and metrics on probability measures. JMLR.
• Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR.
Two-sample, independence, conditional independence tests:
• Gretton, A., Fukumizu, K., Teo, C., Song, L., Schoelkopf, B., Smola, A. (2008). A kernel statistical test of
independence. NIPS
• Fukumizu, K., Gretton, A., Sun, X., Schoelkopf, B. (2008). Kernel measures of conditional dependence.
• Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B. (2009). A fast, consistent kernel two-sample
test. NIPS.
• Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR
Energy distance, relation to kernel distances
• Sejdinovic, D., Sriperumbudur, B., Gretton, A.,, Fukumizu, K., (2013). Equivalence of distance-based and
rkhs-based statistics in hypothesis testing. Annals of Statistics.
Three way interaction
• Sejdinovic, D., Gretton, A., and Bergsma, W. (2013). A Kernel Test for Three-Variable Interactions. NIPS.
Selected references (continued)
Conditional mean embedding, RKHS-valued regression:
• Weston, J., Chapelle, O., Elisseeff, A., Scholkopf, B., and Vapnik, V., (2003). Kernel Dependency
Estimation, NIPS.
• Micchelli, C., and Pontil, M., (2005). On Learning Vector-Valued Functions. Neural Computation.
• Caponnetto, A., and De Vito, E. (2007). Optimal Rates for the Regularized Least-Squares Algorithm.
Foundations of Computational Mathematics.
• Song, L., and Huang, J., and Smola, A., Fukumizu, K., (2009). Hilbert Space Embeddings of Conditional
Distributions. ICML.
• Grunewalder, S., Lever, G., Baldassarre, L., Patterson, S., Gretton, A., Pontil, M. (2012). Conditional mean
embeddings as regressors. ICML.
• Grunewalder, S., Gretton, A., Shawe-Taylor, J. (2013). Smooth operators. ICML.
Kernel Bayes rule:
• Song, L., Fukumizu, K., Gretton, A. (2013). Kernel embeddings of conditional distributions: A unified
kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine.
• Fukumizu, K., Song, L., Gretton, A. (2013). Kernel Bayes rule: Bayesian inference with positive definite
kernels, JMLR
References
L.
Barin
ghaus
and
C.
Franz.
On
anew
multiv
aria
te
two-s
am
ple
test.
J.
Multiv
ariate
Anal.,
88:1
90–206,2004.
C.Berg,J.P.R.Christensen,and
P.Ressel.
Harmonic
Analysis
on
Semigro
ups.
Sprin
ger,New
York,1984.
Andrey
Feuerverger.A
consistenttestfo
rbiv
aria
te
dependence.In
ternatio
nal
Sta
tistic
alReview,61(3):4
19–433,1993.
K.Fukum
izu,F.R.Bach,and
M.I.
Jordan.
Dim
ensio
nalit
yreductio
nfo
rsupervised
learnin
gwith
reproducin
gkernel
Hilb
ert
spaces.
Journalof
Machin
eLearnin
gResearch,5:7
3–99,2004.
K.Fukum
izu,
A.G
retton,
X.Sun,
and
B.Scholk
opf.
Kernelm
easures
of
conditio
naldependence.In
Advancesin
Neura
lIn
formatio
nProcessin
gSystems
20,pages
489–496,Cam
brid
ge,M
A,2008.M
ITPress.
A.G
retton
and
L.G
yorfi.
Consistent
nonparam
etric
tests
ofin
dependence.
JournalofM
achin
eLearnin
gResearch,11:1
391–1423,2010.
A.G
retton,O
.Bousquet,A.J.Sm
ola
,and
B.Scholk
opf.
Measurin
gstatisti-
caldependencewith
Hilb
ert-S
chm
idtnorm
s.In
S.Jain
,H.U.Sim
on,and
E.Tom
ita,editors,Proceedin
gsofth
eIn
ternatio
nalConference
on
Algorith
mic
Learnin
gTheory,pages
63–77.Sprin
ger-V
erla
g,2005.
A.G
retton,K
.Borgwardt,M
.Rasch,B.Scholk
opf,
and
A.J.Sm
ola
.A
ker-
nelm
ethod
for
the
two-s
am
ple
proble
m.
InAdvancesin
Neura
lIn
formatio
n
Processin
gSystems15,pages
513–520,Cam
brid
ge,M
A,2007.M
ITPress.
A.G
retton,K
.Fukum
izu,C.-H
.Teo,L.Song,B.Scholk
opf,
and
A.J.Sm
ola
.A
kernelstatistic
altestofin
dependence.In
Advancesin
Neura
lIn
formatio
n
Processin
gSystems20,pages
585–592,Cam
brid
ge,M
A,2008.M
ITPress.
A.G
retton,K
.Borgwardt,M
.Rasch,B.Schoelk
opf,
and
A.Sm
ola
.A
kernel
two-s
am
ple
test.
JM
LR,13:7
23–773,2012a.
A.G
retton,B.Srip
erum
budur,D
.Sejd
inovic
,H.Strathm
ann,S.Bala
krish-
nan,M
.Pontil,
and
K.Fukum
izu.
Optim
alkernelchoic
efo
rla
rge-s
cale
two-s
am
ple
tests.
InNIP
S,2012b.
T.Read
and
N.Cressie
.Goodness-O
f-Fit
Sta
tistic
sforDiscrete
Multiv
ariate
Anal-
ysis.
Sprin
ger-V
erla
g,New
York,1988.
104-1
A.J.Sm
ola
,A.G
retton,L.Song,and
B.Scholk
opf.
AHilb
ert
space
em
-beddin
gfo
rdistrib
utio
ns.
InProceedin
gs
ofth
eIn
ternatio
nalConference
on
Algorith
mic
Learnin
gTheory,volu
me
4754,pages
13–31.Sprin
ger,2007.
G.Szekely
and
M.Riz
zo.Testin
gfo
requaldistrib
utio
nsin
hig
hdim
ensio
n.
InterSta
t,5,2004.
G.Szekely
and
M.Riz
zo.A
new
testfo
rm
ultiv
aria
tenorm
alit
y.J.M
ultiv
ariate
Anal.,
93:5
8–80,2005.
G.Szekely
and
M.Riz
zo.
Brownia
ndistance
covaria
nce.
Annals
ofApplie
d
Sta
tistic
s,4(3):1
233–1303,2009.
G.Szekely
,M
.Riz
zo,and
N.Bakirov.M
easurin
gand
testin
gdependence
by
correla
tio
nofdistances.
Ann.Sta
t.,35(6):2
769–2794,2007.
K.Zhang,J.Peters,D
.Janzin
g,B.,
and
B.Scholk
opf.
Kernel-b
ased
con-
ditio
nalin
dependence
test
and
applic
atio
nin
causaldiscovery.
In27th
Conferenceon
Uncerta
inty
inArtifi
cialIn
tellig
ence,pages
804–813,2011.
104-2