vector-valued distribution regression – keep it simple and...
TRANSCRIPT
![Page 1: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/1.jpg)
Vector-valued Distribution Regression – Keep ItSimple and Consistent
Zoltan Szabo
Joint work with Bharath K. Sriperumbudur (PSU), Barnabas Poczos(CMU), Arthur Gretton (UCL)
CSML Reading GroupDepartment of StatisticsUniversity of Oxford
May 1, 2015
Zoltan Szabo Vector-valued Distribution Regression
![Page 2: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/2.jpg)
The task
Samples: {(xi , yi )}li=1. Goal: f (xi ) ≈ yi , find f ∈ H.
Distribution regression:
xi -s are distributions,available only through samples: {xi ,n}Ni
n=1.
⇒ Training examples: labelled bags.
Zoltan Szabo Vector-valued Distribution Regression
![Page 3: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/3.jpg)
Example: aerosol prediction from satellite images
Bag := pixels of a multispectral satellite image over an area.Label of a bag := aerosol value.
Engineered methods [Wang et al., 2012]: 100×RMSE = 7.5− 8.5.Using distribution regression?
Zoltan Szabo Vector-valued Distribution Regression
![Page 4: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/4.jpg)
Wider context
Context:
machine learning: multi-instance learning,statistics: point estimation tasks (without analytical formula).
Applications:
computer vision: image = collection of patch vectors,network analysis: group of people = bag of friendship graphs,natural language processing: corpus = bag of documents,time-series modelling: user = set of trial time-series.
Zoltan Szabo Vector-valued Distribution Regression
![Page 5: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/5.jpg)
Several algorithmic approaches
1 Parametric fit: Gaussian, MOG, exp. family[Jebara et al., 2004, Wang et al., 2009, Nielsen and Nock, 2012].
2 Kernelized Gaussian measures:[Jebara et al., 2004, Zhou and Chellappa, 2006].
3 (Positive definite) kernels:[Cuturi et al., 2005, Martins et al., 2009, Hein and Bousquet, 2005].
4 Divergence measures (KL, Renyi, Tsallis): [Poczos et al., 2011].
5 Set metrics: Hausdorff metric [Edgar, 1995]; variants[Wang and Zucker, 2000, Wu et al., 2010, Zhang and Zhou, 2009,Chen and Wu, 2012].
Zoltan Szabo Vector-valued Distribution Regression
![Page 6: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/6.jpg)
Theoretical guarantee?
MIL dates back to [Haussler, 1999, Gartner et al., 2002].
Sensible methods in regression: require density estimation[Poczos et al., 2013, Oliva et al., 2014, Reddi and Poczos, 2014]+ assumptions:
1 compact Euclidean domain.2 output = R.
Zoltan Szabo Vector-valued Distribution Regression
![Page 7: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/7.jpg)
Kernel, RKHS
k : D×D → R kernel on D, if
∃ϕ : D → H(ilbert space) feature map,k(a, b) = 〈ϕ(a), ϕ(b)〉H (∀a, b ∈ D).
Kernel examples: D = Rd (p > 0, θ > 0)
k(a, b) = (〈a, b〉+ θ)p: polynomial,
k(a, b) = e−‖a−b‖22/(2θ2): Gaussian,
k(a, b) = e−θ‖a−b‖2 : Laplacian.
In the H = H(k) RKHS (∃!): ϕ(u) = k(·, u).
Zoltan Szabo Vector-valued Distribution Regression
![Page 8: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/8.jpg)
Kernel: example domains (D)
Euclidean space: D = Rd .
Graphs, texts, time series, dynamical systems.
Distributions.
Zoltan Szabo Vector-valued Distribution Regression
![Page 9: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/9.jpg)
Universal kernel
Def.: k : D×D → R kernel is universal if
it is continuous,H(k) is dense in (C (D), ‖·‖∞).
Examples: on compact subsets of Rd
k(a, b) = e−‖a−b‖22
2σ2 , (σ > 0)
k(a, b) = eβ〈a,b〉, (β > 0), or more generally
k(a, b) = f (〈a, b〉), f (x) =∞∑
n=0
anxn (∀an > 0)
Zoltan Szabo Vector-valued Distribution Regression
![Page 10: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/10.jpg)
Problem formulation (Y = R)
Given:
labelled bags z = {(xi , yi)}li=1,
i th bag: xi = {xi ,1, . . . , xi ,N} i .i .d.∼ xi ∈ M+1 (D), yi ∈ R.
Task: find a M+1 (D) → R mapping based on z.
Zoltan Szabo Vector-valued Distribution Regression
![Page 11: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/11.jpg)
Problem formulation (Y = R)
Given:
labelled bags z = {(xi , yi)}li=1,
i th bag: xi = {xi ,1, . . . , xi ,N} i .i .d.∼ xi ∈ M+1 (D), yi ∈ R.
Task: find a M+1 (D) → R mapping based on z.
Construction: distribution embedding (µx) + ridge regression
M+1 (D)
µ=µ(k)−−−−→ X ⊆ H = H(k)f ∈H=H(K)−−−−−−−→ R.
Zoltan Szabo Vector-valued Distribution Regression
![Page 12: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/12.jpg)
Problem formulation (Y = R)
Given:
labelled bags z = {(xi , yi)}li=1,
i th bag: xi = {xi ,1, . . . , xi ,N} i .i .d.∼ xi ∈ M+1 (D), yi ∈ R.
Task: find a M+1 (D) → R mapping based on z.
Construction: distribution embedding (µx) + ridge regression
M+1 (D)
µ=µ(k)−−−−→ X ⊆ H = H(k)f ∈H=H(K)−−−−−−−→ R.
Our goal: risk bound compared to the regression function
fρ(µx) =
∫
R
ydρ(y |µx).
Zoltan Szabo Vector-valued Distribution Regression
![Page 13: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/13.jpg)
Goal in details
Contribution: analysis of the excess risk
E(f λz , fρ) = R[f λz ]−R[fρ] ≤ g(l ,N, λ) → 0 and rates,
R [f ] = E(x ,y) |f (µx)− y |2 (expected risk),
f λz = argminf ∈H
1
l
l∑
i=1
|f (µxi )− yi |2 + λ ‖f ‖2H, (λ > 0).
We consider two settings:
1 well-specified case: fρ ∈ H,
2 misspecified case: fρ ∈ L2ρX \H.
Zoltan Szabo Vector-valued Distribution Regression
![Page 14: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/14.jpg)
Step-1: mean embedding
k : D×D → R kernel; canonical feature map: ϕ(u) = k(·, u).Mean embedding of a distribution x , xi ∈ M
+1 (D):
µx =
∫
D
k(·, u)dx(u) ∈ H(k),
µxi =
∫
D
k(·, u)dxi (u) =1
N
N∑
n=1
k(·, xi ,n).
Linear K ⇒ set kernel:
K (µxi , µxj ) =⟨
µxi , µxj
⟩
H=
1
N2
N∑
n,m=1
k(xi ,n, xj ,m).
Zoltan Szabo Vector-valued Distribution Regression
![Page 15: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/15.jpg)
Step-2: ridge regression (analytical solution)
Given:
training sample: z,test distribution: t.
Prediction:
(f λz ◦ µ)(t) = k(K+ lλIl)−1[y1; . . . ; yl ], (1)
K = [K (µxi , µxj )] ∈ Rl×l , (2)
k = [K (µx1, µt), . . . ,K (µxl , µt)] ∈ R1×l . (3)
Zoltan Szabo Vector-valued Distribution Regression
![Page 16: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/16.jpg)
Blanket assumptions
D: separable, topological domain.
k :
bounded: supu∈D k(u, u) ≤ Bk ∈ (0,∞),continuous.
K : bounded; Holder continuous: ∃L > 0, h ∈ (0, 1] such that
‖K (·, µa)− K (·, µb)‖H ≤ L ‖µa − µb‖hH .
y : bounded.
X = µ(
M+1 (D)
)
∈ B(H).
Zoltan Szabo Vector-valued Distribution Regression
![Page 17: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/17.jpg)
Performance guarantees (in human-readable format)
If in addition
1 well-specified case: fρ is ’c-smooth’ with ’b-decaying
covariance operator’ and l ≥ λ− 1b−1, then
E(f λz , fρ) ≤logh(l)
Nhλ3+ λc +
1
l2λ+
1
lλ1b
. (4)
2 misspecified case: fρ is ’s-smooth’, L2ρX is separable, and1λ2 ≤ l , then
E(f λz , fρ) ≤log
h2 (l)
Nh2λ
32
+1√lλ
+
√λmin(1,s)
λ√l
+ λmin(1,s). (5)
Zoltan Szabo Vector-valued Distribution Regression
![Page 18: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/18.jpg)
Performance guarantee: example
Misspecified case: assume
s ≥ 1, h = 1 (K : Lipschitz),✄
✂
�
✁1 =
✄
✂
�
✁3 in (5) ⇒ λ; l = Na (a > 0)
t = lN: total number of samples processed.
Then
1 s = 1 (’most difficult’ task): E(f λz , fρ) ≈ t−0.25,
2 s → ∞ (’simplest’ problem): E(f λz , fρ) ≈ t−0.5.
Zoltan Szabo Vector-valued Distribution Regression
![Page 19: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/19.jpg)
Notes on the assumptions: ∃ρ, X ∈ B(H)
k : bounded, continuous ⇒µ : (M+
1 (D),B(τw )) → (H ,B(H)) measurable.µ measurable, X ∈ B(H) ⇒ ρ on X × Y : well-defined.
If (*) := D is compact metric, k is universal, then
µ is continuous, andX ∈ B(H).
Zoltan Szabo Vector-valued Distribution Regression
![Page 20: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/20.jpg)
Notes on the assumptions: Holder K examples
In case of (*):
KG Ke KC
e−‖µa−µb‖
2H
2θ2 e−‖µa−µb‖H
2θ2
(
1 + ‖µa − µb‖2H /θ2)−1
h = 1 h = 12 h = 1
Kt Ki
(
1 + ‖µa − µb‖θH)−1 (
‖µa − µb‖2H + θ2)− 1
2
h = θ2 (θ ≤ 2) h = 1
They are functions of ‖µa − µb‖H ⇒ computation: similar to setkernel.
Zoltan Szabo Vector-valued Distribution Regression
![Page 21: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/21.jpg)
Notes on the assumptions: misspecified case
L2ρX : separable ⇔ measure space with d(A,B) = ρX (A△ B) is so[Thomson et al., 2008].
Zoltan Szabo Vector-valued Distribution Regression
![Page 22: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/22.jpg)
Vector-valued output: Y = separable Hilbert
Objective function:
f λz = argminf ∈H
1
l
l∑
i=1
‖f (µxi )− yi‖2Y + λ ‖f ‖2H, (λ > 0).
K (µa, µb) ∈ L(Y ): vector-valued RKHS.
Zoltan Szabo Vector-valued Distribution Regression
![Page 23: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/23.jpg)
Vector-valued output: analytical solution
Analytical solution: prediction on a new test distribution (t)
(f λz ◦ µ)(t) = k(K+ lλIl)−1[y1; . . . ; yl ], (6)
K = [K (µxi , µxj )] ∈ L(Y )l×l , (7)
k = [K (µx1, µt), . . . ,K (µxl , µt)] ∈ L(Y )1×l . (8)
Specially: Y = R ⇒ L(Y ) = R; Y = Rd ⇒ L(Y ) = R
d .
Zoltan Szabo Vector-valued Distribution Regression
![Page 24: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/24.jpg)
Vector-valued output: K assumptions
Boundedness and Holder continuity of K :
1 Boundedness:
‖Kµa‖2HS = Tr(
K ∗µaKµa
)
≤ BK ∈ (0,∞), (∀µa ∈ X ).
2 Holder continuity: ∃L > 0, h ∈ (0, 1] such that
‖Kµa − Kµb‖L(Y ,H) ≤ L ‖µa − µb‖hH , ∀(µa, µb) ∈ X × X .
Zoltan Szabo Vector-valued Distribution Regression
![Page 25: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/25.jpg)
Demo-1 (Y = R): Supervised entropy learning
Problem: learn the entropy of the 1st coo. of (rotated)Gaussians.
Baseline: kernel smoothing based distribution regression(applying density estimation) =: DFDR.
Performance: RMSE boxplot over 25 random experiments.
Experience:
more precise than the only theoretically justified method,by avoiding density estimation.
Zoltan Szabo Vector-valued Distribution Regression
![Page 26: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/26.jpg)
Supervised entropy learning: plots
0 1 2 3−2
−1
0
1
2RMSE: MERR=0.75, DFDR=2.02
rotation angle (β)
entr
opy
true
MERR
DFDR
MERR DFDR
1
2
3
4
RM
SE
Zoltan Szabo Vector-valued Distribution Regression
![Page 27: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/27.jpg)
Demo-2 (Y = R): Aerosol prediction from satellite images
Performance: 100 × RMSE.
Baseline [mixture model (EM)]: 7.5 − 8.5 (±0.1 − 0.6).
Linear K :
single: 7.91 (±1.61).ensemble: 7.86 (±1.71).
Nonlinear K :
Single: 7.90 (±1.63),Ensemble: 7.81 (±1.64).
Zoltan Szabo Vector-valued Distribution Regression
![Page 28: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/28.jpg)
Summary
Problem: distribution regression.
Literature: large number of heuristics.
Contribution:
a simple ridge solution is consistent,specifically, the set kernel is so (15-year-old open question).
Simplified version [Y = R, fρ ∈ H]:
accepted at AISTATS-2015 (oral).
Zoltan Szabo Vector-valued Distribution Regression
![Page 29: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/29.jpg)
Summary – continued
MERR code (ITE toolbox), complete analysis (submitted toJMLR):
https://bitbucket.org/szzoli/ite/
http://arxiv.org/abs/1411.2066.
Closely related research directions (Bayesian world):
∞-dimensional exp. family fitting,just-in-time kernel EP: submitted to UAI-2015.
Zoltan Szabo Vector-valued Distribution Regression
![Page 30: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/30.jpg)
Thank you for the attention!
Acknowledgments: This work was supported by the Gatsby Charitable
Foundation, and by NSF grants IIS1247658 and IIS1250350. The work
was carried out while Bharath K. Sriperumbudur was a research fellow in
the Statistical Laboratory, Department of Pure Mathematics and
Mathematical Statistics at the University of Cambridge, UK.
Zoltan Szabo Vector-valued Distribution Regression
![Page 31: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/31.jpg)
Appendix: contents
Topological definitions, separability.
Exact prior definitions.
Vector-valued RKHS.
Hausdorff metric.
Weak topology on M+1 (D).
Zoltan Szabo Vector-valued Distribution Regression
![Page 32: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/32.jpg)
Topological space, open sets
Given: D 6= ∅ set.
τ ⊆ 2D is called a topology on D if:1 ∅ ∈ τ , D ∈ τ .2 Finite intersection: O1 ∈ τ , O2 ∈ τ ⇒ O1 ∩ O2 ∈ τ .3 Arbitrary union: Oi ∈ τ (i ∈ I ) ⇒ ∪i∈IOi ∈ τ .
Then, (D, τ) is called a topological space; O ∈ τ : open sets.
Zoltan Szabo Vector-valued Distribution Regression
![Page 33: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/33.jpg)
Closed-, compact set, closure, dense subset, separability
Given: (D, τ). A ⊆ D is
closed if D\A ∈ τ (i.e., its complement is open),
compact if for any family (Oi )i∈I of open sets withA ⊆ ∪i∈IOi , ∃i1, . . . , in ∈ I with A ⊆ ∪n
j=1Oij .
Closure of A ⊆ D:
A :=⋂
A⊆C closed in D
C . (9)
A ⊆ D is dense if A = D.
(D, τ) is separable if ∃ countable, dense subset of D.Counterexample: l∞/L∞.
Zoltan Szabo Vector-valued Distribution Regression
![Page 34: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/34.jpg)
Prior (well-specified case): ρ ∈ P(b, c)
Let the T : H → H covariance operator be
T =
∫
X
K (·, µa)K∗(·, µa)dρX (µa)
with eigenvalues tn (n = 1, 2, . . .).
Assumption: ρ ∈ P(b, c) = set of distributions on X × Y
α ≤ nbtn ≤ β (∀n ≥ 1;α > 0, β > 0),
∃g ∈ H such that fρ = Tc−12 g with ‖g‖2
H≤ R (R > 0),
where b ∈ (1,∞), c ∈ [1, 2].
Intuition: 1/b – effective input dimension, c – smoothness offρ.
Zoltan Szabo Vector-valued Distribution Regression
![Page 35: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/35.jpg)
Prior: misspecified case
Let T be defined as:
S∗K : H → L2ρX ,
SK : L2ρX → H, (SKg)(µu) =
∫
X
K (µu, µt)g(µt)dρX (µt),
T = S∗KSK : L2ρX → L2ρX .
Our range space assumption on ρ: fρ ∈ Im(
T s)
for some s ≥ 0.
Zoltan Szabo Vector-valued Distribution Regression
![Page 36: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/36.jpg)
Vector-valued RKHS: H = H(K )
Definition:
A H ⊆ Y X Hilbert space of functions is RKHS if
Aµx ,y : f ∈ H 7→ 〈y , f (µx)〉Y ∈ R (10)
is continuous for ∀µx ∈ X , y ∈ Y .
= The evaluation functional is continuous in every direction.
Zoltan Szabo Vector-valued Distribution Regression
![Page 37: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/37.jpg)
Vector-valued RKHS: H = H(K ) – continued
Riesz representation theorem ⇒ ∃K (µx |y)∈ H:
〈y , f (µx)〉Y = 〈K (µx |y), f 〉H (∀f ∈ H). (11)
K (µx |y): linear, bounded in y ⇒ K (µx |y)= Kµx (y) withKµx ∈ L(Y ,H).
Zoltan Szabo Vector-valued Distribution Regression
![Page 38: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/38.jpg)
Vector-valued RKHS: H = H(K ) – continued
Riesz representation theorem ⇒ ∃K (µx |y)∈ H:
〈y , f (µx)〉Y = 〈K (µx |y), f 〉H (∀f ∈ H). (11)
K (µx |y): linear, bounded in y ⇒ K (µx |y)= Kµx (y) withKµx ∈ L(Y ,H).
K construction:
K (µx , µt)(y) = (Kµt y)(µx), (∀µx , µt ∈ X ), i.e.,
K (·, µt)(y) = Kµty , (12)
H(K ) = span{Kµt y : µt ∈ X , y ∈ Y }. (13)
Zoltan Szabo Vector-valued Distribution Regression
![Page 39: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/39.jpg)
Vector-valued RKHS: H = H(K ) – continued
Riesz representation theorem ⇒ ∃K (µx |y)∈ H:
〈y , f (µx)〉Y = 〈K (µx |y), f 〉H (∀f ∈ H). (11)
K (µx |y): linear, bounded in y ⇒ K (µx |y)= Kµx (y) withKµx ∈ L(Y ,H).
K construction:
K (µx , µt)(y) = (Kµt y)(µx), (∀µx , µt ∈ X ), i.e.,
K (·, µt)(y) = Kµty , (12)
H(K ) = span{Kµt y : µt ∈ X , y ∈ Y }. (13)
Shortly: K (µx , µt) ∈ L(Y ) generalizes k(u, v) ∈ R.
Zoltan Szabo Vector-valued Distribution Regression
![Page 40: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/40.jpg)
Vector-valued RKHS – examples: Y = Rd
1 Ki : X × X → R kernels (i = 1, . . . , d). Diagonal kernel:
K (µa, µb) = diag(K1(µa, µb), . . . ,Kd (µa, µb)). (14)
2 Combination of Dj diagonal kernels [Dj(µa, µb) ∈ Rr×r ,
Aj ∈ Rr×d ]:
K (µa, µb) =m∑
j=1
A∗j Dj(µa, µb)Aj . (15)
Zoltan Szabo Vector-valued Distribution Regression
![Page 41: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/41.jpg)
Existing methods: set metric based algorithms
Hausdorff metric [Edgar, 1995]:
dH(X ,Y ) = max
{
supx∈X
infy∈Y
d(x , y), supy∈Y
infx∈X
d(x , y)
}
. (16)
Metric on compact sets of metric spaces [(M , d); X ,Y ⊆ M ].’Slight’ problem: highly sensitive to outliers.
Zoltan Szabo Vector-valued Distribution Regression
![Page 42: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/42.jpg)
Weak topology on M+1 (D)
Def.: It is the weakest topology such that the
Lh : (M+1 (D), τw ) → R,
Lh(x) =
∫
D
h(u)dx(u)
mapping is continuous for all h ∈ Cb(D), where
Cb(D) = {(D, τ) → R bounded, continuous functions}.
Zoltan Szabo Vector-valued Distribution Regression
![Page 43: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/43.jpg)
Chen, Y. and Wu, O. (2012).Contextual Hausdorff dissimilarity for multi-instance clustering.
In International Conference on Fuzzy Systems and KnowledgeDiscovery (FSKD), pages 870–873.
Cuturi, M., Fukumizu, K., and Vert, J.-P. (2005).Semigroup kernels on measures.Journal of Machine Learning Research, 6:11691198.
Edgar, G. (1995).Measure, Topology and Fractal Geometry.Springer-Verlag.
Gartner, T., Flach, P. A., Kowalczyk, A., and Smola, A.(2002).Multi-instance kernels.In International Conference on Machine Learning (ICML),pages 179–186.
Zoltan Szabo Vector-valued Distribution Regression
![Page 44: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/44.jpg)
Haussler, D. (1999).Convolution kernels on discrete structures.Technical report, Department of Computer Science, Universityof California at Santa Cruz.(http://cbse.soe.ucsc.edu/sites/default/files/convolutions.pdf).
Hein, M. and Bousquet, O. (2005).Hilbertian metrics and positive definite kernels on probabilitymeasures.In International Conference on Artificial Intelligence andStatistics (AISTATS), pages 136–143.
Jebara, T., Kondor, R., and Howard, A. (2004).Probability product kernels.Journal of Machine Learning Research, 5:819–844.
Martins, A. F. T., Smith, N. A., Xing, E. P., Aguiar, P. M. Q.,and Figueiredo, M. A. T. (2009).Nonextensive information theoretical kernels on measures.
Zoltan Szabo Vector-valued Distribution Regression
![Page 45: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/45.jpg)
Journal of Machine Learning Research, 10:935–975.
Nielsen, F. and Nock, R. (2012).A closed-form expression for the Sharma-Mittal entropy ofexponential families.Journal of Physics A: Mathematical and Theoretical,45:032003.
Oliva, J. B., Neiswanger, W., Poczos, B., Schneider, J., andXing, E. (2014).Fast distribution to real regression.International Conference on Artificial Intelligence andStatistics (AISTATS; JMLR W&CP), 33:706–714.
Poczos, B., Rinaldo, A., Singh, A., and Wasserman, L. (2013).Distribution-free distribution regression.International Conference on Artificial Intelligence andStatistics (AISTATS; JMLR W&CP), 31:507–515.
Poczos, B., Xiong, L., and Schneider, J. (2011).
Zoltan Szabo Vector-valued Distribution Regression
![Page 46: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/46.jpg)
Nonparametric divergence estimation with applications tomachine learning on distributions.In Uncertainty in Artificial Intelligence (UAI), pages 599–608.
Reddi, S. J. and Poczos, B. (2014).k-NN regression on functional data with incompleteobservations.In Conference on Uncertainty in Artificial Intelligence (UAI).
Thomson, B. S., Bruckner, J. B., and Bruckner, A. M. (2008).Real Analysis.Prentice-Hall.
Wang, F., Syeda-Mahmood, T., Vemuri, B. C., Beymer, D.,and Rangarajan, A. (2009).Closed-form Jensen-Renyi divergence for mixture of Gaussiansand applications to group-wise shape registration.Medical Image Computing and Computer-AssistedIntervention, 12:648–655.
Wang, J. and Zucker, J.-D. (2000).
Zoltan Szabo Vector-valued Distribution Regression
![Page 47: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/47.jpg)
Solving the multiple-instance problem: A lazy learningapproach.In International Conference on Machine Learning (ICML),pages 1119–1126.
Wang, Z., Lan, L., and Vucetic, S. (2012).Mixture model for multiple instance regression andapplications in remote sensing.IEEE Transactions on Geoscience and Remote Sensing,50:2226–2237.
Wu, O., Gao, J., Hu, W., Li, B., and Zhu, M. (2010).Identifying multi-instance outliers.In SIAM International Conference on Data Mining (SDM),pages 430–441.
Zhang, M.-L. and Zhou, Z.-H. (2009).Multi-instance clustering with applications to multi-instanceprediction.Applied Intelligence, 31:47–68.
Zoltan Szabo Vector-valued Distribution Regression
![Page 48: Vector-valued Distribution Regression – Keep It Simple and ...zoltan.szabo/talks/invited_talk/Zoltan... · Vector-valued Distribution Regression – Keep It Simple and Consistent](https://reader034.vdocuments.net/reader034/viewer/2022050301/5f6a61512f610116904e49bb/html5/thumbnails/48.jpg)
Zhou, S. K. and Chellappa, R. (2006).From sample similarity to ensemble similarity: Probabilisticdistance measures in reproducing kernel Hilbert space.IEEE Transactions on Pattern Analysis and MachineIntelligence, 28:917–929.
Zoltan Szabo Vector-valued Distribution Regression