information geometry...outline information geometry historical remarks treasure example information...
TRANSCRIPT
Information Geometry
Shinto Eguchi
The Institute of Statistical Mathematics, Japan Email: [email protected], [email protected] Url: http//www.ism.ac.jp/~eguchi/
2013年統計数理研究所夏期大学院統計数理研究所 セミナー室 1 9月26日 14時40分~17時50分
ISM Summer School 2013
Outline
Information geometry
historical remarks
treasure example
Information divergence class
robust statistical methods
robust ICA
robust kernel PCA
2
geometry
learning
statistics
quantum
1982 paperS. Amari
1975 paperB. Efron
P. Dawidcomment
Historical comment
1984Workshop
RSS150
C.R.Rao1945
R.A.Fisher1922
3
Dual Riemannian Geometry
Dual Riemannian geometry gives reformulation for Fisher’s foundation
Cf. Einstein field equations (1916)
Information Geometery aims to geometrizeA dualistic structure between modeling and estimation
Estimation is projection of data onto model.
“Estimation is an action by projection and model is an object to be projected”
The interplay of action and object is elucidated by a geometric structure.
Cf. Erlangen program (Klein, 1872)http://en.wikipedia.org/wiki/Erlangen_program
4
2 x 2 table
The space of all 2×2 tables associates with a regular tetrahedron
We know the ruled surface is exponential geodesic, butdoes anyone know the minimality of the ruled surface?
The space of all independent 2×2 tables associates with a ruled surfaceIn the regular tetrahedron
5
regular tetrahedron
P
Q
S
R
0 10 0
0 00 1 0 0
1 0
1 00 0
1/4 1/4
1/4 1/4
0 p0 1p
1p 0 p 0
6
regular tetrahedron
P
Q
S
R
0 1/30 2/3
1/3 0 2/3 0
1/3 2/3 0 0
0 01/3 2/3
+
=
32
32
31
32
32
31
31
31
31
32
1/3 0 2/3 0
0 1/30 2/3
+31
32=1/3 2/3
0 0
0 01/3 2/3
7
P
Q
S
R
P
Q
S
R
Ruled surface
8
.
1 00 0
p q p (1-q)(1-p)q (1-p) (1-q)
0 00 1
0 10 0
hgfeyzxwn
22 )(
0)1()1(
})1)(1()1)(1({ 2
qqppqpqpqppqn
0 01 0
A B
C x y e
D z w f
g h n wzyxnwyhzxgwzfyxe
,,,
A B
C pq p(1q) P
D (1p)q (1p)(1q) 1p
q 1q 1
e-geodesical
-10
0
10 -5
0
5
-5
0
5
-10
0
10
-5
0
5
-10
0
10-10
-5
0
5
10
-5
0
5
-10
0
10
}){( 10,10,1
log,1
log,)1)(1(
log
qpq
qp
pqp
qp }{ RI,RI),,,( yxyxyx
2112112222
21
22
12
22
11211211 1wherelog,log,log),,( )(
3S 3RI
10
Two parametrization
-10
0
10 -5
0
5
-5
0
5
-10
0
10
-5
0
5
)( )1(),1(, qpqpqp
)( ,qp
)(1
log,1
log,)1)(1(
logq
qp
pqp
qp
11
Kullback-Leibler Divergence
p
q
r
Let p(x) and q(x) be probability density functions.
Then Kullback-Leibler Divergence is defined by
12
Two one-parameter familiesLet be the space of all pdfs on a data space.
ここで
p
q
r
m-geodesic
e-geodesic
13
Pythagoras theorem Amari-Nagaoka (1982)
p
q
r
14
Proof
15
ABC in differential geometry
Linear connection defines parallelism along a vector field.
Geodesic = a minimal arc between x and y
Riemannian metric defines an inner product on any tangent space of a manifold.
RI)()(:),( MMggM XX
1
0))(),((argmin
10})1(,)0(:)({
dttxtxgtyxxxtxc
)()()(: MMM XXX
))(,),(()()()2(
,)1(
MYXMfYfXYfYf
YfY
XX
XXf
XF
d
kk
kjijX XX
i1
ComponetwiseCf. slide 41
16
Geodesic
A one-parameter family (curve) is called geodesic }:)({ ttC θ
),...,1(0)()())(()(1 1
pitttt kkp
k
p
j
ikj
i
θ
with respect to a linear connection },,1:)({ pkjiikj θ
Remark 2:
If , any geodesic is a line.),,1(0)( pkjiikj θ
The property is not invariant with parametrization.
The gedesic C is expressed by a transform rule from parameter to
),...,1(0)()())((~)(1 1
patttt cbp
c
p
b
abc
a
ω
,)()(~11 1 1
p
ac
iba
i
p
k
p
j
p
i
ikj
ai
jb
kc
abc
BBBBB
θω )(for, θaaa
iiai
aai tBB
speed of acceleration
Newton's law of inertia
17
Change rule on connections
Let us consider a transform from parameter to . Then a geodesic C is
written by . Hence we have the following:
),...,1(0)()())((~)(1 1
patttt cbp
c
p
b
abc
a
ω
,)()(~11 1 1
p
ac
iba
i
p
k
p
j
p
i
ikj
ai
jb
kc
abc
BBBBB
θω )(for)( 1θaa
a
iiaB
}:))(()({ ttt θω
),...,1()())(()(1
pattBtp
i
iai
a
θ
),...,1()()())(()())(()(1 11
patttBttBtp
j
p
i
jij
ai
p
i
iai
a
θθ
i
aaiB
18
What are reference books?
Foundations of Differential Geometry (Wiley Classics Library) Shoshichi Kobayashi, Katsumi Nomizu
Methods of Information Geometry Shun-Ichi Amari , Hiroshi NagaokaAmer Mathematical Society (2001)
Differential Geometrical Methods in StatisticsShun-Ichi Amari 1985 年 Springer
19
Statistical model and information
Statistical model }:),()({ θθxxθ ppM
Score vector ),(log),( θxθxsθ
p
Space of score vector }RI:),({ T psT αθαθ
Fisher information metric ),(}{),( θθθ TvuuvEvug
Fisher information matrix }),(),({ Tθxθxθθ ssEI
)RI( p
20
Score vector space
Note 1:
0),(),(
),(),(),(),()(T
T
xθxθxsα
xθxθxsαxθxθxθθ
dp
dpdpuuETu
βαβxθxθxsθxsα
xθxθxθx
θ
θθ
Idp
dpvuvugTvuTTT }),(),(),({
),(),(),(),(,
})),((,0)),((:),({ θxθxθx θθθ tVtEtS
The space of all random variables with mean 0 and finite variance
Note 2: is an infinite dimensional vector space that includesθS θT
),(),(For θxsθx Tu
,....2,1},{ ),(...
),(...
},),({),(11
kuEuuEuiki
k
iki
kkk θxθxθxθx θθ
belong to θS Cf. Tiatis (2006) 21
Linear connections
),,1()( }{ '1'
'e
pkjissEg ijk
p
i
iiikj
θθ
),,1()( }]{}{[ ''1'
'm
pkjisssEssEg ijkijk
p
i
iiikj
θθθ
e-connection
m-connection
m-geodesic
e-geodesic
Score vector piis 1)),((),( θxθxs
22
Semiparamtric thoery
Semiparametric Theory and Missing Data (Springer Series in Statistics) Anastasios Tsiatis
1 Introduction to Semiparametric Models
2 Hilbert Space for Random Vectors
3 The Geometry of Influence Functions
4 Semiparametric Models
5 Other Examples of Semiparametric Models
6 Models and Methods for Missing Data
7 Missing and Coarsening at Random for Semiparametric Models
8 The Nuisance Tangent Space and Its Orthogonal Complement
9 …14.
23
Geodesical models
Definition
))1,0(()()()(),( 1 tMqpcMqp ttt xxxx
(i) Statistical model M is said to be e-geodesical if
(ii) Statistical model M is said to be m-geodesic if
))1,0(()()()1()(),( tMqtptMqp xxxx
Note : Let P be a space of all probability density functions.
xxx dqpc ttt )()(/1 1where
By definition P is e-geodesical and m-geodesical.
However the theoretical framework for P is not perfectly complete.
Cf. Pistone and Sempi (1995, AS)24
Two type of modeling
})};()(exp{)(),({ 0)e( θθxtθxθx TppM
})}({)}({:)({0
)m( xtxtx pp EEpM
Exponential model
Matched mean model
})(:RI{ θθ p
Let p0(x) be a pdf and t (x) a p-dimensional statistic.
})};()()(logexp{)(),({
1 00
)e(
θθxxxθx
I
i
ii p
pppM
}:)()()1(),({1
01
)m(
I
iii
I
ii pppM xxθxMixture model
})(:RI{ θθ p
Let be a set of pdfs.},...,1,0:)({ Iipi x
)},...,1(0,10:),...,({1
1 Iii
I
iiI
θ
Exponential model
25
Statistical functional
A functional f (p) is called a statistical functional if p(x) is a pdf. (Hampel, 1974)
A statistical functional f (p) is said to be Fisher-consistent for a model
,)()1( pf
)()()2( θθf θp
if f (p) satisfies that}:)({ θxθpM
For a normal model }RIRI),(},2
)(exp{2
1)({ 22
2
2
θxθ
xpM
Tdxxxpdxxdxxxpp )))((,)(()( 22 f
is Fisher-consistent for = (, 2 ).
Example.
26
Transversality
Let f (p) be Fisher consistent functional.
which is called a leaf transverse to M.
,})(:{)( θffθ ppL
}{)()1( θθ f pML
)()2( fθL
is a local neighborhood.
)( fθL
)(* fθL
θp
*θp
M
27
Foliation structure
)( fP θL
))((),(),( fP θθθθLTvMTuTt ppp
),(),(),(such that θxθxθx vut
))(()()( fP θθθθLTMTT ppp
)( fθL
M
θp
),( θxt ),( θxu
),( θxv
Foliation
Decomposition of tanget spaces
Statistical model }:),()({ θθxxθ ppM )RI( p
28
Transversality for MLE
})};()(exp{)(),({ 0)e( θθxtθxθx TppMExponential model
Hence the foliation associated with fML is a matched mean model.
})(:RI{ θθ p
For the exponential model the MLE functional)e(M
)},({logmaxarg:)(ML θxfθ
pEp p
}{ )}({)}({arg)( ),(ML xtxtf θθ
pp EEp
is written by
})}({)}({:{)( ),(ML xtxtf θθ pp EEpL
Let p0(x) be a pdf and t (x) a p-dimensional statistic.
29
Maximum likelihood foliation
)( MLfθL
θp
)e(M
1p 2p
3p
2r
1r
30
Estimating function
),( θxu
Statistical model }:),()({ θθxxθ ppM )RI( p
p-variable function is unbiased
)(0)}),({(det,}),({ ),(),(
θ
θθxuθxu θθ pp EE 0
def
The statistical functional }),({solvearg)( 0
θxufθ
pEp
is Fisher-consitent, and the leaf transverse to M
}),(:{)( 0 θxufθ pEpL
is m-geodesical.)( fθL
31
Minimum divergence geometry
D : M×M → R is an information divergence on a statistical model M
(i) D(p,q) ≧ 0 with equality if and only if p = q
(ii) D is a differentialble on M×M
Then we get a Riemannian metric and dual connections on M (Eguchi, 1983,1992)
Let
32
is a Riemann metric
Remarks
Cf. Slide 21 33
U cross-entropy
U is a convex function,
U entropy
U divergence
U divergence
34
35
Example of U divergence
KL divergence
Beta (power) divergence
Note
35
36
Geometric formula with DU
36
37
Triangle with DU
p
q
r
mixture geodesic
U geodesic
37
Information divergence class
and
robust statistical methods
38
Light and shadow of MLE
1.Invariance under data-transformations
2.Asymptotic efficiency
1. Non-robust
Sufficiency and efficiency
log exp
u
2. Over-fitting
Log-likelihood on exponential family
Likelihood method
U-method
39
Max U-entropy distribution
Equal mean space
Let us fix a statistics t (x).
Euler-Lagrange’s 1st variation
U-model
40
U-estimate
U-loss function
U-estimate
U-estimating function
U-empirical loss function
Let p(x) be a data density function with statistical model
41
-minimax
Equal mean space
Let us fix a statistics t (x).
42
U-estimator for U-model
U-model
mean parameterU-estimator of
We observe that
which implies
Furthermore,
43
Influence function
Statistical functional
})),((()()),(({minarg)(
dxxpUxdGxpGTU
0
)(),(IF
GTxT
xFG )1(
)}],(),({),(),()[(),(IF 1 xSxwExSxwJxT UU
Influence function
||),(IF||sup)(GES xTT Ux
U 44
Efficiency
Asymptotic variance
))()()(,0()ˆ( 11 UUUD
U JHJNn
)),(),(()(
),),(),(),(()(
XSXwVarH
XSXSXwEJ
U
TU
111 )()()()( UUU JHJI
Information inequality
where
( equality holds iff U(x) = exp(x) )
45
Normal mean
-6 -4 -2 2 4 6
-6
-4
-2
2
4
6
0.125
0.1
0.075
0.05
0.01
0
-6 -4 -2 2 4 6
-6
-4
-2
2
4
6
2.5
0.8
0.4
0.15
0.015
0
β-power estimates η-sigmoid estimates
Influence functionβ= 0
= 0.015= 0.15= 0.4= 0.8= 2.5
η = 0= 0.01= 0.05= 0.075= 0.1= 0.125
tttU )exp()( 46
Gross Error Sensitivity
β-power estimates η-sigmoid estimates
β 効率 GES η 効率 GES0 1 ∞ 0 1 ∞
0.01 0.97 6.16 0.015 0.972 1.90.05 0.861 2.92 0.15 0.873 1.040.075 0.799 2.47 0.4 0.802 0.6780.1 0.742 2.21 0.8 0.753 0.455
0.125 0.689 2.04 2.5 0.694 0.197
47
Multivariate normal
pdf is
2
)()(exp)det)2((),,(1
21 yyyf
Tp
equation Estimating
),(
)(}))(({),(1
0)(),(1
μθ
μxμxθx
μxθx
UT
ii
ii
cwn
wn
48
111 , , kkkkkk μθμθ
),(
),(1
ki
jkik xw
xxw
,
det ),(
)()(),(1
kUki
Tkikiki
k
cxw
xxxw
繰り返し重み付け平均と分散
Under a mild condition
),...1 ( )(L )(L 1 kkUkU
U-algorithm
49
Simulation
ε-conmination
)ˆ( 0KL θ,θD
, N , N-εG
, -
N .
. , N-εG
ε
ε
9999
00
1001
00
)1(
1004
55
250501
00
)1(
(2)
(1)
(2)
(1)
under 6516702under2439033
05.00
G . . G . .
εε
KL error
with MLE θ̂
50
β-power estimates v.s. η-sigmoid estimates
0 39.240.01 35.300.05 20.930.10 8.910.20 12.400.30 31.64
0 39.240.0001 23.04
0.00025 6.700.0005 4.46
0.00075 4.640.001 6.04
βηKL error KL error
0 16.50.01 13.910.05 6.650.10 5.140.20 12.200.30 29.68
0 16.50.0005 3.36
0.00075 3.190.001 3.90.002 3.040.003 3.07
(1)G
)2(G
51
Plot of MLEs (no outliers)
1112
22
100 replications with100 size of sampleunder Normal dis.
η- Est (η=0.0025)β- Est (β=0.1)MLEtrue (1,0,1)
2212
1211
52
Plot of MLEs with ouliers
η- Est (η=0.0025)β- Est (β=0.1)MLEtrue (1,0,1)
)2(G
100 replications with100 size of sampleunder
11.5
2
2.5
0
0.5
1
1.5
0.5
1
1.5
2
11.5
2
2.5
0
0.5
1
1.5
53
Selection for tuning parameter
Squared loss function
dyygyf 221 )}()ˆ,({)ˆ(Loss
dyyfxfn
ii
n
i
2)(
1
)ˆ,(21)ˆ,(1)(CV
)(CVminargˆ
)ˆ,(IF1
1ˆˆ )( i
i xn
Approximate
54
.26.1-
.1-.26 varianceand
00
mean with Normal
0.0250.050.0750.10.1250.150.175
2.82
2.84
2.86
2.88
2.9
2.92
2.94
.261.126-.126-.228
0.081-
0.054
signalMLE
.383.263-.263-1.059
0.184-
0.204
ContamiMLE
.286.134-.134-.293
0.132-
0.086
contamiEst-
β
)(C V
07.0ˆ
55
What is ICA?
Cocktail party effect
Blind source separation
s1 s2 ……… sm
x1 x2 ……… xm
56
ICA model
sWxW mmm 1 s.t., RR
(Independent signals)
(Linear mixture of signals)
0)(,,0)()()()(),...,(
1
111
m
mmm
SESEspspspsss
~
),,( 1 nxx
))(())((|)det(|),,( 111 mmmppWpWf μxwμxwx
Aim is to learn W from a dataset
in which unknown. is)()()( 11 mm spspp s
57
ICA likelihood
Log-likelihood function
|)det(|log))((log)(11
WpW jijj
m
j
n
i
μxw
TTm WWxWxhI
WpWxpWxF
))()((),,(),,(
)( )(log,,)(log)(1
11
m
mm
ssp
sspsh
Estimating equation
Natural gradient algorithm, Amari et al. (1996)58
Beta-ICA
power equation ),(),,(),,(11
WBWxFWxf
n
n
iii
decomposability:
0]))}({E[
)]()}({E[
])}({E[,
XwXwp
XwhXwp
Xwp
tttt
ssssss
tqsqqq
ts
59
Likelihood ICA
0.5 1 1.5 2 2.5
0.2
0.4
0.6
0.8
1
1.2
1.4
-1.5 -1 -0.5 0.5 1 1.5
-1.5
-1
-0.5
0.5
1
1.5
5.01211W
150 signals U(0,1)× U(0,1)
Mixture matrix
60
Non-robustness
-2 -1 1 2 3
-2
-1
1
2
-2 -1 1
-2
-1
1
2
50 Gaussian noise N(0, 1)× N(0, 1)
61
power-ICA (β=0.2)
-2 -1 1 2 3
-2
-1
1
2
-10 -7.5 -5 -2.5 2.5 5 7.5
-6
-4
-2
2
4
Minimum -power
62
Tsallisべきエントロピーに射影不変性を課したエントロピ-、ダイバージェンスを考察した。
平均と分散を一定にする分布族で射影べきエントロピー最大分布モデルが正規分布,
t-分布,ウエグナー分布を含む形で導出された.
そのモデル上での平均と分散の素直な推定法を提案し,推定量が標本平均と標本分散
になることが示された.
63
-cross entropy
-diagonal entropy
Cf. Tsallis (1988), Basu et al (1998), Fujisawa-Eguchi (2008), Eguchi-Kato (2010)
-divergence
Remark= 0 reduces to Boltzmann-Shannon entropy and Kullback-Leibler divergence
Projective -power divergence
64
Max -entropy
-entropy
Equal moment space
Max -entropy
-model
65
-estimation
Parametric model
-Loss function
-estimator
Gaussian
-estimator
Remark Cf. Fujisawa-Eguchi (2008)66
-estimation on -model
-loss function
Theorem.
67
model
loss function
-loss
normal-model -model
(-estimator ’-model)
0-loss
( log-likelihood) MLE
Moment estimatorRobust -estimator
Robust model
MLE
68
Mixture of independent signals
Whitening
Preprocessed form
Statistical model
-ICA
Gamma-ICA
69
70
71
Boosting leaning algorithms
for supervised/unsupervized learning
U-loss functions
Outline
72
Which chameleon wins?
73
Pattern recognition…
74
What is pattern recognition?
□ There are a lot of examples for pattern recognition.
□ In principle pattern recognition is a prediction problem for class label
which would classify a human interestingness and importance.
□ Originally human brain wants to label phenomena a few words , for
example (good, bad), (yes, no), (dead, alive), (success, failure), (effective,
no effect)….
□ Brain intrinsically predicts the class label from empirical evidence.
75
Practice
● Character recognitionvoice recognition
Image recognition
☆ Credit scoring
☆ Medical screening
☆ Default prediction
☆ Weather forcast
●
●
● face recognition
finger print recognition●
speaker recognition●
☆ Treatment effect
☆ Failure prediction ☆ Infectious desease
☆ Drug response
76
Binary classification
classifier
Feature vector Class label
score
In a binary class y = 1, 1 (G=2)
77
Probability distribution
).,( vector random of pdf a be),(Let yyp xx
B Cy
ypCBp }d),({),( xx
Xxx d),()( ypypMarginal density
Conditional density
},...,1{
),()(Gy
ypp xx
)(),()|(
xxx
pypyp
)(),()|(
ypypyp xx
)()|()()|(),( xxxx pypypypyp
)|1()|1(
)1,()1,(
xx
xx
ypyp
ypyp
78
Error rate
Classifier
Feature vector ,class label
has error rate
))(Pr()(Err yhh FF x
G
iF
jiFF iyihjyihh
1
),)(Pr(1),)(Pr()(Err xx
Training error
Test error
79
False negative/positive
FNR
True Negative
True Positive False Positive
False Negative
FPR
80
Bayes rule
Let p(y|x) be a conditional probability of y given x.
The classifier leads to Bayes rule. ))(sgn()( 0Bayes xx Fh
For any classifier h
)(Err)(Err Bayes hh
Note: The optimal classifier is equivalent to the likelihood ratio.However, in practice p(y|x) is unknown, so we have tolearn hBayes(x) based on the training data set.
Theorem 1
81
Discriminant space associated with Bayes classifier is given by
Error rate for Bayes rule
In general, when a classifier h associates with spaces
)(Err)(Err Bayes hh Cf. Mclachlan, 2002
82
Boost learning
Boost by filter (Schapire, 1990)
Bagging, Arching (bootstrap)(Breiman, Friedman, Hasite)
AdaBoost (Schapire, Freund, Batrlett, Lee)
Weak learners can be combined into a strong leanrner?
Weak learner (classifier) = error rate is slightly less than .5
Strong learner (classifier) = error rate is slightly less than that of Bayes classifier
83
Set of weak learners
Decision stumps
Linear classifiers
Neural net SVM k-nearest neighbur
Note: not strong but a variety of characters
84
Exponential loss function
Empirical exponential loss function for a score function F(x)
)}(exp{1)(1
exp ii
n
i
D Fyn
FL x
where q(y|x) is the conditional distribution given x, q(x) is the pdf of x.
Let be a training (example) set.
xxxxX
d)()}|()}(exp{{)(}1,1{
expqyqyFFL
y
E
Expected exponential loss function for a score function F(x)
85
0)(),1()(:Intial.1 011
xFniiw n
,)'(
)())((I)(
iwiwfyf
t
tiit x
)(
)(121
)(
)(log)b(tt
ttt
f
f
T
tttTT fFF
1)( )()( where,)(sign.3 )( xxx
Tt ,,1For .2
))(exp()()()c( )(1 iitttt yfiwiw x
)(min)()a( )( ff tftt F
Adaboost
86
Learning algorithm
)},( ),...,,{( 11 nn yy xx
)(,),1( 11 nww
)(,),1( 22 nww
1
2
1
2
T
)()1( xf
T
1)( )(
ttt f x
)(,),1( nww TT
)()2( xf
)()( xTf
Final learner
T
tttT fF
1)( )()( xx
87
Update weight
21)( )(1 tt f The worst case
)()( 1 iwiw tt
t
t
eyf
eyf
iit
iit
Multiply )(
Multiply )(
)(
)(
x
x
update
Weighted error rate )()()( )1(1)(1)( tttttt fff
88
21)( )(1 tt f
)'()())(()(
1
1
1)(1 iw
iwyfIft
tn
iiittt x
n
ititit
n
itititiit
iwfy
iwfyyfI
1)(
1)()(
)()}(exp{
)()}(exp{))((
x
xx
n
itiitt
n
itiitt
n
itiitt
iwyfIiwyfI
iwyfI
1)(
1)(
1)(
)())((}exp{)())((}exp{
)())((}exp{
xx
x
21
)}(1{)(1
)()(
)()(1
)()(
)(1
)()(
)()(
)(
)(
)()(
)(
tttt
tttt
tt
tt
tttt
tt
ff
ff
ff
ff
f
89
Update in exponential loss
)}(exp{1)(1
exp ii
n
i
Fyn
FL x
)()()( xxx fFF
}))(1()(){(exp fefeFL
Consider
n
iiiiiii yfeyfeFy
n 1
))(I())(I()}(exp{1 xxx
n
iiiii fyFy
nfFL
1exp )}(exp{)}(exp{1)( xx
)(
)}(exp{))(()(
exp
1
FL
FyyfIf
n
iiiij
xxwhere
90
Sequential optimization
}))(1()(){()( expexp efefFLfFL
)()(1log
21
opt ff
)}(1){(2 ff
)}(1){(2)()(12
ffefe
f
Equality holds if and only if
efef ))(1()(
91
AdaBooost = Seq-min exponential loss
)(minarg)a( )( ff tFf
t
)}(exp{)()((c) 1 itittt xfyiwiw
)}(1){()()(min )()(1exp)(1exp ttttt ffFLfFL
R
)()(1
log21
)(
)(opt
t
t
ff
)(minarg)b( )(1exp ttt fFL
R
92
Simulation (complete separable)
-1 -0.5 0.5 1
-1
-0.5
0.5
1
[-1,1]×[-1,1]
Feature space
Decision boundary
93
-1 -0.5 0.5 1
-1
-0.5
0.5
1
Set of linear classifiers
Linear classification machines
Random generation
94
Learning process
•
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
Iter = 1, train err = 0.21 Iter = 13, train err = 0.18 Iter = 17, train err = 0.10
Iter = 23, train err = 0.10 Iter = 31, train err = 0.095 Iter = 47, train err = 0.0895
Final decision boundary
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
Contour of F(x) Sign(F(x)) 96
Learning curve
50 100 150 200 250
0.05
0.1
0.15
0.2
Iteration number
Training curve
97
KL divergence
For conditinal distribution p(y|x), q(y|x) given x with commonmarginal density p(x) we can write
Nonnegative function
Feature space
Note
Label set
Then,
98
Twin of KL loss functionFor data distribution q(x, y) = q(x) q(y|x) we model as
G
gyFgFym
11 )},(),(exp{)|( xxx
xxxxxxx
Xd)()}|()|(
)|()|(log)|({),(
}1,1{KL qymyq
ymyqyqmqD
y
Then
G
g
gF
yFym
1
2
)},(exp{
)},(exp{)|(x
xx
Log lossExp loss 99
Bound for exponential loss
Empirical exp loss
Expected exp loss
Theorem.
100
Variational calculas
101
On AdaBoostFDA, or Logistic regression
Parametric approach to Bayes classifier
AdaBoost
The stopping time T can be selected according to the state of learning.
Parametric approach to Bayes classifier, but dimension and basis function are flexible
102
U-Boost
U-empirical loss function
In a context of classification
Unnormalized U-loss
Normalized U-loss
103
U-Boost (binary)
Unnormalized U-loss
Note
Bayes risk consistency
F*(x) is Bayes risk consistent because104
Eta-loss function
regularized
AdaBoost with margin
105
EtaBoost for Mislabels Expected Eta-loss function
Optimal score
The variational argument leads to
Mislabel modeling
106
EtaBoost
0)(),1()(: settings Initial.1 011
xFniiw n
,)())((I)(1
iwfyf m
n
iiim
x
T
tttTT fFF
1)( )()( where,)(sign.3 )( xxx
Tm ,,1For .2
))(exp()()()c( )(*
1 iimmmm yfiwiw x
)(min)()a( )( ff mfmm
(b)
107
A toy example
108
Examples partly mislabeled
109
AdaBoost vs. EtaBoost
AdaBoost EtaBoost110
ROC curve, AUC
cF 閾値:判別関数 ),(: X
)1Y|)(()TPR(
0)Y| )(()FPR(
cFPc
cFPc
X
X
}RI|))(TPR),(FPR{()ROC( cccF
111
0)(Y )(1)(Y )(
陰性
陽性
cFcF
XX
1} {0, ,RI , ),( YY pXX
)'(FPR)'(TPR),AUC( cdcF
TPR
FPR
1
10
c
AUC
偽陽性確率
真の陽性確率
受信者動作特性(ROC) 曲線
下側面積(AUC )
Cf. Bamber (1975), Pepe et al. (2003)
112
113
114
115
116
117
118
λ5-fold CVの結果(AUCBoost)
118
119
120
121
2標本問題とAUCブーストの関係
AUCブーストの学習アルゴリズムは逐次的に各 t-ステップで
得られた判別関数 Ft(x) のウィルコクソン検定統計量
を(t+1)-ステップ
で最大方向に増大するよう設計されている.
このように
となる中で全ての単点解析が統合されている.
この意味で全ての単点解析はブースティングに
組むことは可能である. Eguchi-Komori (2009)
))()((1)(ˆ0
1 11
10
1 0
it
n
k
n
iktt FFI
nnFC xx
)()()( 111 xxx tttt fFF
)(ˆ)(ˆ1tt FCFC
122
123
124
Partial AUC
))( ),X()X((P
)(FPR)(TPRpAUC
010 XFcFF
cdcc
)(xF
健常者 病人
cTPR
FPR
1
10
)(TPR c
)(FPR c )(xF cTPR
FPR
1
10
pAUC
)(FPR c
)(TPR c
健常者 病人
cc
pAUC
125
126
Robust for noninformative genes
127
Robustness for noninformative genes
128
生成関数
真のU 陽性確率
129
Boost learning algorithm
130
Bayes risk consistency
131
ベンチマークデータ
132
T r u e d e n s i t y
- 4
- 2
0
2
4
- 4
- 2
0
2
4
0
0 . 0 0 5
0 . 0 1
0 . 0 1 5
0 . 0 2
- 4
- 2
0
2
4
W dictionary133
U-Boost learning
)( argmin *
* fLf Uf UW
pfUf
nfL
n
iiU
RI1d)))((())((1)( xxx U-loss function
)( ))((co 1* WW U
Dictionary of density functions
},1d)(,0)(:)({ xxxx gggW
Learning space = U-model
)}( ))(({ 1
xg
Goal : find
)())0,1,,0(,(Then.))((),(Let )(
1 )( xxxπx
gfgf
WW * U
134
Inner step in the convex hull
))((co1* WW U}:),({ xW
*UW
W
)(xf
1g
2g3g
4g
5g
6g
7g
)(xf
)( argmin *
* fLf Uf UW
Goal : )))(ˆ(ˆ))(ˆ(ˆ()( ˆˆ111* xxx kk fff
135
Convergence of Algorithm
)()()(
)()()1()(
*11
111
fgg
gff
KK
KKKKK
*UW
-mixture
)()()()()( *111 fffff kkk
),()()( 11 kkUkUkU ffDfLfL
k
jjjUkUU ffDfLfL
1111 ),()()(
136
Simplified boosting
initial with the
))),()()1(((minarg
such that))()()1((
,1,....,0For
1
11
1111
1
ckc
gfLg
gfff
Kk
k
kUk
kkkkkk
g
W
K
kkkK gf
1
1 )( )(ˆ
)(minarg1 gLf Ug W
137
Then we have
Non-asymptotic bound
UbU 2
))(co(),,()}()(){(sup)A(
WWW
),(inf),(FA fgDg Uf
UU=W
=W
|}| )(EI))((supEI2),(EE1
1{ ffg pi
n
inf
g
xW
W
constant)length-step:(1
),(IE22
ccK
bcK U
W
(Functional approximation)
(Estimation error)
(Iteration effect)
),(EEand),(FAbetween Trade Remark. WW pp *U
where
),,IE( ),EE(),FA()ˆ,(EI * WWW KggfgD UKUg
Theorem. Assume that a data distribution has a density g(x) and that
138
Lemmas
Lemma 1 Then).))(((andestimator density abe~Let 11* N
j jjpff x
).,,~()}()~(){1()())))(())(~()1((( ***1
1 fffLfLfLfLp UUUUNj jUj
xx
Lemma 2 Then.))(co(in be~Let Wf
])1,0[(),,( 22* UU bff
Lemma 3 (A)assumptionUnder
))((where1
)(inf)ˆ( 122
jjU
UKU fcK
bcfLfL
Lemma 4 then,)(inf)ˆ( If
fLfL UKU
||)(EI))((1||supEI2),(inf)ˆ,(EI
ffn
fgDfgD gif
gUKUg xW
139
KDERSDE
-Boost -Boost
C (skewed-unimodal) L (quadrimodal) H (bimodal)
140
geometry
learning
statistics
quantum
Future of information geometry
Deepening
Optimal transform
Free probability
Semiparametrics
Compressed sensing
Poincare conjecture
Dynamical systems
Mathematical evolutionary theory
Dempster-Shafer theory141
Poincaré conjecture 1904
Every simply connected, closed 3-manifold is
homeomorphic to the 3-sphere
Ricci flow by Richard Hamilton in 1981 Perelmann 2002, 2003
Optimal control theory due to Pontryagin and Bellman
Optimal transport
Wasserstein space }~,~:||||E{min),( 2W gYfXYXgfd
142
Optimal transport
Theorem (Brenier, 1991)
Talagrand inequality
Log Sovolev inequality
Optimal transport theory is extending on a general manifold
Geometers consider not a space, but a distribution family on the space.
C. Villani (2010) Optimal transport theory as Fields medal
143
Thank you
144