information geometry...outline information geometry historical remarks treasure example information...

Information Geometry

Shinto Eguchi

The Institute of Statistical Mathematics, Japan Email: [email protected], [email protected] Url: http//www.ism.ac.jp/~eguchi/

2013年統計数理研究所夏期大学院統計数理研究所セミナー室 1 9月26日 14時40分～17時50分

ISM Summer School 2013

Outline

Information geometry

historical remarks

treasure example

Information divergence class

robust statistical methods

robust ICA

robust kernel PCA

2

geometry

learning

statistics

quantum

1982 paperS. Amari

1975 paperB. Efron

P. Dawidcomment

Historical comment

1984Workshop

RSS150

C.R.Rao1945

R.A.Fisher1922

3

Dual Riemannian Geometry

Dual Riemannian geometry gives reformulation for Fisher’s foundation

Cf. Einstein field equations (1916)

Information Geometery aims to geometrizeA dualistic structure between modeling and estimation

Estimation is projection of data onto model.

“Estimation is an action by projection and model is an object to be projected”

The interplay of action and object is elucidated by a geometric structure.

Cf. Erlangen program (Klein, 1872)http://en.wikipedia.org/wiki/Erlangen_program

4

2 x 2 table

The space of all 2×2 tables associates with a regular tetrahedron

We know the ruled surface is exponential geodesic, butdoes anyone know the minimality of the ruled surface?

The space of all independent 2×2 tables associates with a ruled surfaceIn the regular tetrahedron

5

regular tetrahedron

P

Q

S

R

0 10 0

0 00 1 0 0

1 0

1 00 0

1/4 1/4

1/4 1/4

0 p0 1p

1p 0 p 0

6

regular tetrahedron

P

Q

S

R

0 1/30 2/3

1/3 0 2/3 0

1/3 2/3 0 0

0 01/3 2/3

+

=

32

32

31

32

32

31

31

31

31

32

1/3 0 2/3 0

0 1/30 2/3

+31

32=1/3 2/3

0 0

0 01/3 2/3

7

P

Q

S

R

P

Q

S

R

Ruled surface

8

．

1 00 0

p q p (1-q)(1-p)q (1-p) (1-q)

0 00 1

0 10 0

hgfeyzxwn

22 )(

0)1()1(

})1)(1()1)(1({ 2

qqppqpqpqppqn

0 01 0

A B

C x y e

D z w f

g h n wzyxnwyhzxgwzfyxe

,,,

A B

C pq p(1q) P

D (1p)q (1p)(1q) 1p

q 1q 1

e-geodesical

-10

0

10 -5

0

5

-5

0

5

-10

0

10

-5

0

5

-10

0

10-10

-5

0

5

10

-5

0

5

-10

0

10

}){( 10,10,1

log,1

log,)1)(1(

log

qpq

qp

pqp

qp }{ RI,RI),,,( yxyxyx

2112112222

21

22

12

22

11211211 1wherelog,log,log),,( )(

3S 3RI

10

Two parametrization

-10

0

10 -5

0

5

-5

0

5

-10

0

10

-5

0

5

)( )1(),1(, qpqpqp

)( ,qp

)(1

log,1

log,)1)(1(

logq

qp

pqp

qp

11

Kullback-Leibler Divergence

p

q

r

Let p(x) and q(x) be probability density functions.

Then Kullback-Leibler Divergence is defined by

12

Two one-parameter familiesLet be the space of all pdfs on a data space.

ここで

p

q

r

m-geodesic

e-geodesic

13

Pythagoras theorem Amari-Nagaoka (1982)

p

q

r

14

Proof

15

ABC in differential geometry

Linear connection defines parallelism along a vector field.

Geodesic = a minimal arc between x and y

Riemannian metric defines an inner product on any tangent space of a manifold.

RI)()(:),( MMggM XX

1

0))(),((argmin

10})1(,)0(:)({

dttxtxgtyxxxtxc

)()()(: MMM XXX

))(,),(()()()2(

,)1(

MYXMfYfXYfYf

YfY

XX

XXf

XF

d

kk

kjijX XX

i1

ComponetwiseCf. slide 41

16

Geodesic

A one-parameter family (curve) is called geodesic }:)({ ttC θ

),...,1(0)()())(()(1 1

pitttt kkp

k

p

j

ikj

i

θ

with respect to a linear connection },,1:)({ pkjiikj θ

Remark 2：

If , any geodesic is a line.),,1(0)( pkjiikj θ

The property is not invariant with parametrization.

The gedesic C is expressed by a transform rule from parameter to

),...,1(0)()())((~)(1 1

patttt cbp

c

p

b

abc

a

ω

,)()(~11 1 1

p

ac

iba

i

p

k

p

j

p

i

ikj

ai

jb

kc

abc

BBBBB

θω )(for, θaaa

iiai

aai tBB

speed of acceleration

Newton's law of inertia

17

Change rule on connections

Let us consider a transform from parameter to . Then a geodesic C is

written by . Hence we have the following:

),...,1(0)()())((~)(1 1

patttt cbp

c

p

b

abc

a

ω

,)()(~11 1 1

p

ac

iba

i

p

k

p

j

p

i

ikj

ai

jb

kc

abc

BBBBB

θω )(for)( 1θaa

a

iiaB

}:))(()({ ttt θω

),...,1()())(()(1

pattBtp

i

iai

a

θ

),...,1()()())(()())(()(1 11

patttBttBtp

j

p

i

jij

ai

p

i

iai

a

θθ

i

aaiB

18

What are reference books?

Foundations of Differential Geometry (Wiley Classics Library) Shoshichi Kobayashi, Katsumi Nomizu

Methods of Information Geometry Shun-Ichi Amari , Hiroshi NagaokaAmer Mathematical Society (2001)

Differential Geometrical Methods in StatisticsShun-Ichi Amari 1985 年 Springer

19

Statistical model and information

Statistical model }:),()({ θθxxθ ppM

Score vector ),(log),( θxθxsθ

p

Space of score vector }RI:),({ T psT αθαθ

Fisher information metric ),(}{),( θθθ TvuuvEvug

Fisher information matrix }),(),({ Tθxθxθθ ssEI

)RI( p

20

Score vector space

Note 1：

0),(),(

),(),(),(),()(T

T

xθxθxsα

xθxθxsαxθxθxθθ

dp

dpdpuuETu

βαβxθxθxsθxsα

xθxθxθx

θ

θθ

Idp

dpvuvugTvuTTT }),(),(),({

),(),(),(),(,

})),((,0)),((:),({ θxθxθx θθθ tVtEtS

The space of all random variables with mean 0 and finite variance

Note 2: is an infinite dimensional vector space that includesθS θT

),(),(For θxsθx Tu

,....2,1},{ ),(...

),(...

},),({),(11

kuEuuEuiki

k

iki

kkk θxθxθxθx θθ

belong to θS Cf. Tiatis (2006) 21

Linear connections

),,1()( }{ '1'

'e

pkjissEg ijk

p

i

iiikj

θθ

),,1()( }]{}{[ ''1'

'm

pkjisssEssEg ijkijk

p

i

iiikj

θθθ

e-connection

m-connection

m-geodesic

e-geodesic

Score vector piis 1)),((),( θxθxs

22

Semiparamtric thoery

Semiparametric Theory and Missing Data (Springer Series in Statistics) Anastasios Tsiatis

1 Introduction to Semiparametric Models

2 Hilbert Space for Random Vectors

3 The Geometry of Influence Functions

4 Semiparametric Models

5 Other Examples of Semiparametric Models

6 Models and Methods for Missing Data

7 Missing and Coarsening at Random for Semiparametric Models

8 The Nuisance Tangent Space and Its Orthogonal Complement

9 …14.

23

Geodesical models

Definition

))1,0(()()()(),( 1 tMqpcMqp ttt xxxx

(i) Statistical model M is said to be e-geodesical if

(ii) Statistical model M is said to be m-geodesic if

))1,0(()()()1()(),( tMqtptMqp xxxx

Note : Let P be a space of all probability density functions.

xxx dqpc ttt )()(/1 1where

By definition P is e-geodesical and m-geodesical.

However the theoretical framework for P is not perfectly complete.

Cf. Pistone and Sempi (1995, AS)24

Two type of modeling

})};()(exp{)(),({ 0)e( θθxtθxθx TppM

})}({)}({:)({0

)m( xtxtx pp EEpM

Exponential model

Matched mean model

})(:RI{ θθ p

Let p0(x) be a pdf and t (x) a p-dimensional statistic.

})};()()(logexp{)(),({

1 00

)e(

θθxxxθx

I

i

ii p

pppM

}:)()()1(),({1

01

)m(

I

iii

I

ii pppM xxθxMixture model

})(:RI{ θθ p

Let be a set of pdfs.},...,1,0:)({ Iipi x

)},...,1(0,10:),...,({1

1 Iii

I

iiI

θ

Exponential model

25

Statistical functional

A functional f (p) is called a statistical functional if p(x) is a pdf. (Hampel, 1974)

A statistical functional f (p) is said to be Fisher-consistent for a model

,)()1( pf

)()()2( θθf θp

if f (p) satisfies that}:)({ θxθpM

For a normal model }RIRI),(},2

)(exp{2

1)({ 22

2

2

θxθ

xpM

Tdxxxpdxxdxxxpp )))((,)(()( 22 f

is Fisher-consistent for = (, 2 ).

Example.

26

Transversality

Let f (p) be Fisher consistent functional.

which is called a leaf transverse to M.

,})(:{)( θffθ ppL

}{)()1( θθ f pML

)()2( fθL

is a local neighborhood.

)( fθL

)(* fθL

θp

*θp

M

27

Foliation structure

)( fP θL

))((),(),( fP θθθθLTvMTuTt ppp

),(),(),(such that θxθxθx vut

))(()()( fP θθθθLTMTT ppp

)( fθL

M

θp

),( θxt ),( θxu

),( θxv

Foliation

Decomposition of tanget spaces

Statistical model }:),()({ θθxxθ ppM )RI( p

28

Transversality for MLE

})};()(exp{)(),({ 0)e( θθxtθxθx TppMExponential model

Hence the foliation associated with fML is a matched mean model.

})(:RI{ θθ p

For the exponential model the MLE functional)e(M

)},({logmaxarg:)(ML θxfθ

pEp p

}{ )}({)}({arg)( ),(ML xtxtf θθ

pp EEp

is written by

})}({)}({:{)( ),(ML xtxtf θθ pp EEpL

Let p0(x) be a pdf and t (x) a p-dimensional statistic.

29

Maximum likelihood foliation

)( MLfθL

θp

)e(M

1p 2p

3p

2r

1r

30

Estimating function

),( θxu

Statistical model }:),()({ θθxxθ ppM )RI( p

p-variable function is unbiased

)(0)}),({(det,}),({ ),(),(

θ

θθxuθxu θθ pp EE 0

def

The statistical functional }),({solvearg)( 0

θxufθ

pEp

is Fisher-consitent, and the leaf transverse to M

}),(:{)( 0 θxufθ pEpL

is m-geodesical.)( fθL

31

Minimum divergence geometry

D : M×M → R is an information divergence on a statistical model M

(i) D(p,q) ≧ 0 with equality if and only if p = q

(ii) D is a differentialble on M×M

Then we get a Riemannian metric and dual connections on M (Eguchi, 1983,1992)

Let

32

is a Riemann metric

Remarks

Cf. Slide 21 33

U cross-entropy

U is a convex function,

U entropy

U divergence

U divergence

34

35

Example of U divergence

KL divergence

Beta (power) divergence

Note

35

36

Geometric formula with DU

36

37

Triangle with DU

p

q

r

mixture geodesic

U geodesic

37

Information divergence class

and

robust statistical methods

38

Light and shadow of MLE

1．Invariance under data-transformations

2．Asymptotic efficiency

1. Non-robust

Sufficiency and efficiency

log exp

u

2. Over-fitting

Log-likelihood on exponential family

Likelihood method

U-method

39

Max U-entropy distribution

Equal mean space

Let us fix a statistics t (x).

Euler-Lagrange’s 1st variation

U-model

40

U-estimate

U-loss function

U-estimate

U-estimating function

U-empirical loss function

Let p(x) be a data density function with statistical model

41

-minimax

Equal mean space

Let us fix a statistics t (x).

42

U-estimator for U-model

U-model

mean parameterU-estimator of

We observe that

which implies

Furthermore,

43

Influence function

Statistical functional

})),((()()),(({minarg)(

dxxpUxdGxpGTU

0

)(),(IF

GTxT

xFG )1(

)}],(),({),(),()[(),(IF 1 xSxwExSxwJxT UU

Influence function

||),(IF||sup)(GES xTT Ux

U 44

Efficiency

Asymptotic variance

))()()(,0()ˆ( 11 UUUD

U JHJNn

)),(),(()(

),),(),(),(()(

XSXwVarH

XSXSXwEJ

U

TU

111 )()()()( UUU JHJI

Information inequality

where

( equality holds iff U(x) = exp(x) )

45

Normal mean

-6 -4 -2 2 4 6

-6

-4

-2

2

4

6

0.125

0.1

0.075

0.05

0.01

0

-6 -4 -2 2 4 6

-6

-4

-2

2

4

6

2.5

0.8

0.4

0.15

0.015

0

β-power estimates η-sigmoid estimates

Influence functionβ= 0

= 0.015= 0.15= 0.4= 0.8= 2.5

η = 0= 0.01= 0.05= 0.075= 0.1= 0.125

tttU )exp()( 46

Gross Error Sensitivity

β-power estimates η-sigmoid estimates

β 効率 GES η 効率 GES0 1 ∞ 0 1 ∞

0.01 0.97 6.16 0.015 0.972 1.90.05 0.861 2.92 0.15 0.873 1.040.075 0.799 2.47 0.4 0.802 0.6780.1 0.742 2.21 0.8 0.753 0.455

0.125 0.689 2.04 2.5 0.694 0.197

47

Multivariate normal

pdf is

2

)()(exp)det)2((),,(1

21 yyyf

Tp

equation Estimating

),(

)(}))(({),(1

0)(),(1

μθ

μxμxθx

μxθx

UT

ii

ii

cwn

wn

48

111 , , kkkkkk μθμθ

),(

),(1

ki

jkik xw

xxw

,

det ),(

)()(),(1

kUki

Tkikiki

k

cxw

xxxw

繰り返し重み付け平均と分散

Under a mild condition

),...1 ( )(L )(L 1 kkUkU

U-algorithm

49

Simulation

ε-conmination

)ˆ( 0KL θ,θD

, N , N-εG

, -

N .

. , N-εG

ε

ε

9999

00

1001

00

)1(

1004

55

250501

00

)1(

(2)

(1)

(2)

(1)

under 6516702under2439033

05.00

G . . G . .

εε

KL error

with MLE θ̂

50

β-power estimates v.s. η-sigmoid estimates

0 39.240.01 35.300.05 20.930.10 8.910.20 12.400.30 31.64

0 39.240.0001 23.04

0.00025 6.700.0005 4.46

0.00075 4.640.001 6.04

βηKL error KL error

0 16.50.01 13.910.05 6.650.10 5.140.20 12.200.30 29.68

0 16.50.0005 3.36

0.00075 3.190.001 3.90.002 3.040.003 3.07

(1)G

)2(G

51

Plot of MLEs （no outliers）

1112

22

100 replications with100 size of sampleunder Normal dis.

η- Est (η=0.0025)β- Est (β=0.1)MLEtrue (1,0,1)

2212

1211

52

Plot of MLEs with ouliers

η- Est (η=0.0025)β- Est (β=0.1)MLEtrue (1,0,1)

)2(G

100 replications with100 size of sampleunder

11.5

2

2.5

0

0.5

1

1.5

0.5

1

1.5

2

11.5

2

2.5

0

0.5

1

1.5

53

Selection for tuning parameter

Squared loss function

dyygyf 221 )}()ˆ,({)ˆ(Loss

dyyfxfn

ii

n

i

2)(

1

)ˆ,(21)ˆ,(1)(CV

)(CVminargˆ

)ˆ,(IF1

1ˆˆ )( i

i xn

Approximate

54

.26.1-

.1-.26 varianceand

00

mean with Normal

0.0250.050.0750.10.1250.150.175

2.82

2.84

2.86

2.88

2.9

2.92

2.94

.261.126-.126-.228

0.081-

0.054

signalMLE

.383.263-.263-1.059

0.184-

0.204

ContamiMLE

.286.134-.134-.293

0.132-

0.086

contamiEst-

β

)(C V

07.0ˆ

55

What is ICA?

Cocktail party effect

Blind source separation

s1 s2 ……… sm

x1 x2 ……… xm

56

ICA model

sWxW mmm 1 s.t., RR

(Independent signals）

(Linear mixture of signals)

0)(,,0)()()()(),...,(

1

111

m

mmm

SESEspspspsss

～

),,( 1 nxx

))(())((|)det(|),,( 111 mmmppWpWf μxwμxwx

Aim is to learn W from a dataset

in which unknown. is)()()( 11 mm spspp s

57

ICA likelihood

Log-likelihood function

|)det(|log))((log)(11

WpW jijj

m

j

n

i

μxw

TTm WWxWxhI

WpWxpWxF

))()((),,(),,(

)( )(log,,)(log)(1

11

m

mm

ssp

sspsh

Estimating equation

Natural gradient algorithm, Amari et al. (1996)58

Beta-ICA

power equation ),(),,(),,(11

WBWxFWxf

n

n

iii

decomposability：

0]))}({E[

)]()}({E[

])}({E[,

XwXwp

XwhXwp

Xwp

tttt

ssssss

tqsqqq

ts

59

Likelihood ICA

0.5 1 1.5 2 2.5

0.2

0.4

0.6

0.8

1

1.2

1.4

-1.5 -1 -0.5 0.5 1 1.5

-1.5

-1

-0.5

0.5

1

1.5

5.01211W

150 signals U(0,1)× U(0,1)

Mixture matrix

60

Non-robustness

-2 -1 1 2 3

-2

-1

1

2

-2 -1 1

-2

-1

1

2

50 Gaussian noise N(0, 1)× N(0, 1)

61

power-ICA (β=0.2)

-2 -1 1 2 3

-2

-1

1

2

-10 -7.5 -5 -2.5 2.5 5 7.5

-6

-4

-2

2

4

Minimum -power

62

Tsallisべきエントロピーに射影不変性を課したエントロピ－、ダイバージェンスを考察した。

平均と分散を一定にする分布族で射影べきエントロピー最大分布モデルが正規分布，

t-分布，ウエグナー分布を含む形で導出された．

そのモデル上での平均と分散の素直な推定法を提案し，推定量が標本平均と標本分散

になることが示された．

63

-cross entropy

-diagonal entropy

Cf. Tsallis (1988), Basu et al (1998), Fujisawa-Eguchi (2008), Eguchi-Kato (2010)

-divergence

Remark= 0 reduces to Boltzmann-Shannon entropy and Kullback-Leibler divergence

Projective -power divergence

64

Max -entropy

-entropy

Equal moment space

Max -entropy

-model

65

-estimation

Parametric model

-Loss function

-estimator

Gaussian

-estimator

Remark Cf. Fujisawa-Eguchi (2008)66

-estimation on -model

-loss function

Theorem.

67

model

loss function

-loss

normal-model -model

(-estimator ’-model)

0-loss

( log-likelihood) MLE

Moment estimatorRobust -estimator

Robust model

MLE

68

Mixture of independent signals

Whitening

Preprocessed form

Statistical model

-ICA

Gamma-ICA

69

Boosting leaning algorithms

for supervised/unsupervized learning

U-loss functions

Outline

72

Which chameleon wins?

73

Pattern recognition…

74

What is pattern recognition?

□ There are a lot of examples for pattern recognition.

□ In principle pattern recognition is a prediction problem for class label

which would classify a human interestingness and importance.

□ Originally human brain wants to label phenomena a few words , for

example (good, bad), (yes, no), (dead, alive), (success, failure), (effective,

no effect)….

□ Brain intrinsically predicts the class label from empirical evidence.

75

Practice

● Character recognitionvoice recognition

Image recognition

☆ Credit scoring

☆ Medical screening

☆ Default prediction

☆ Weather forcast

●

●

● face recognition

finger print recognition●

speaker recognition●

☆ Treatment effect

☆ Failure prediction ☆ Infectious desease

☆ Drug response

76

Binary classification

classifier

Feature vector Class label

score

In a binary class y = 1, 1 (G=2)

77

Probability distribution

).,( vector random of pdf a be),(Let yyp xx

B Cy

ypCBp }d),({),( xx

Xxx d),()( ypypMarginal density

Conditional density

},...,1{

),()(Gy

ypp xx

)(),()|(

xxx

pypyp

)(),()|(

ypypyp xx

)()|()()|(),( xxxx pypypypyp

)|1()|1(

)1,()1,(

xx

xx

ypyp

ypyp

78

Error rate

Classifier

Feature vector ，class label

has error rate

))(Pr()(Err yhh FF x

G

iF

jiFF iyihjyihh

1

),)(Pr(1),)(Pr()(Err xx

Training error

Test error

79

False negative/positive

FNR

True Negative

True Positive False Positive

False Negative

FPR

80

Bayes rule

Let p(y|x) be a conditional probability of y given x.

The classifier leads to Bayes rule. ))(sgn()( 0Bayes xx Fh

For any classifier h

)(Err)(Err Bayes hh

Note： The optimal classifier is equivalent to the likelihood ratio.However, in practice p(y|x) is unknown, so we have tolearn hBayes(x) based on the training data set.

Theorem 1

81

Discriminant space associated with Bayes classifier is given by

Error rate for Bayes rule

In general, when a classifier h associates with spaces

)(Err)(Err Bayes hh Cf. Mclachlan, 2002

82

Boost learning

Boost by filter (Schapire, 1990)

Bagging, Arching （bootstrap）(Breiman, Friedman, Hasite)

AdaBoost (Schapire, Freund, Batrlett, Lee)

Weak learners can be combined into a strong leanrner?

Weak learner (classifier) = error rate is slightly less than .5

Strong learner (classifier) = error rate is slightly less than that of Bayes classifier

83

Set of weak learners

Decision stumps

Linear classifiers

Neural net SVM k-nearest neighbur

Note: not strong but a variety of characters

84

Exponential loss function

Empirical exponential loss function for a score function F(x)

)}(exp{1)(1

exp ii

n

i

D Fyn

FL x

where q(y|x) is the conditional distribution given x, q(x) is the pdf of x.

Let be a training (example) set.

xxxxX

d)()}|()}(exp{{)(}1,1{

expqyqyFFL

y

E

Expected exponential loss function for a score function F(x)

85

0)(),1()(:Intial.1 011

xFniiw n 　　　

,)'(

)())((I)(

iwiwfyf

t

tiit x

)(

)(121

)(

)(log)b(tt

ttt

f

f

T

tttTT fFF

1)( )()( where,)(sign.3 )( xxx

Tt ,,1For .2

))(exp()()()c( )(1 iitttt yfiwiw x

)(min)()a( )( ff tftt F

Adaboost

86

Learning algorithm

)},( ),...,,{( 11 nn yy xx

)(,),1( 11 nww

)(,),1( 22 nww

1

2

1

2

T

)()1( xf

T

1)( )(

ttt f x

)(,),1( nww TT

)()2( xf

)()( xTf

Final learner

T

tttT fF

1)( )()( xx

87

Update weight

21)( )(1 tt f The worst case

)()( 1 iwiw tt

　　　　　　　

　　　　　　

t

t

eyf

eyf

iit

iit

Multiply )(

Multiply )(

)(

)(

x

x

update

Weighted error rate )()()( )1(1)(1)( tttttt fff

88

21)( )(1 tt f

)'()())(()(

1

1

1)(1 iw

iwyfIft

tn

iiittt x

n

ititit

n

itititiit

iwfy

iwfyyfI

1)(

1)()(

)()}(exp{

)()}(exp{))((

x

xx

n

itiitt

n

itiitt

n

itiitt

iwyfIiwyfI

iwyfI

1)(

1)(

1)(

)())((}exp{)())((}exp{

)())((}exp{

xx

x

21

)}(1{)(1

)()(

)()(1

)()(

)(1

)()(

)()(

)(

)(

)()(

)(

tttt

tttt

tt

tt

tttt

tt

ff

ff

ff

ff

f

89

Update in exponential loss

)}(exp{1)(1

exp ii

n

i

Fyn

FL x

)()()( xxx fFF

}))(1()(){(exp fefeFL

Consider

n

iiiiiii yfeyfeFy

n 1

))(I())(I()}(exp{1 xxx

n

iiiii fyFy

nfFL

1exp )}(exp{)}(exp{1)( xx

)(

)}(exp{))(()(

exp

1

FL

FyyfIf

n

iiiij

xxwhere

90

Sequential optimization

}))(1()(){()( expexp efefFLfFL

)()(1log

21

opt ff

)}(1){(2 ff

)}(1){(2)()(12

ffefe

f

Equality holds if and only if

efef ))(1()(

91

AdaBooost = Seq-min exponential loss

)(minarg)a( )( ff tFf

t

)}(exp{)()((c) 1 itittt xfyiwiw

)}(1){()()(min )()(1exp)(1exp ttttt ffFLfFL

R

)()(1

log21

)(

)(opt

t

t

ff

)(minarg)b( )(1exp ttt fFL

R

92

Simulation (complete separable)

-1 -0.5 0.5 1

-1

-0.5

0.5

1

[-1,1]×[-1,1]

Feature space

Decision boundary

93

-1 -0.5 0.5 1

-1

-0.5

0.5

1

Set of linear classifiers

Linear classification machines

Random generation

94

Learning process

•

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

Iter = 1, train err = 0.21 Iter = 13, train err = 0.18 Iter = 17, train err = 0.10

Iter = 23, train err = 0.10 Iter = 31, train err = 0.095 Iter = 47, train err = 0.0895

Final decision boundary

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

Contour of F(x) Sign(F(x)) 96

Learning curve

50 100 150 200 250

0.05

0.1

0.15

0.2

Iteration number

Training curve

97

KL divergence

For conditinal distribution p(y|x), q(y|x) given x with commonmarginal density p(x) we can write

Nonnegative function

Feature space

Note

Label set

Then,

98

Twin of KL loss functionFor data distribution q(x, y) = q(x) q(y|x) we model as

G

gyFgFym

11 )},(),(exp{)|( xxx

xxxxxxx

Xd)()}|()|(

)|()|(log)|({),(

}1,1{KL qymyq

ymyqyqmqD

y

Then

G

g

gF

yFym

1

2

)},(exp{

)},(exp{)|(x

xx

Log lossExp loss 99

Bound for exponential loss

Empirical exp loss

Expected exp loss

Theorem.

100

Variational calculas

101

On AdaBoostFDA, or Logistic regression

Parametric approach to Bayes classifier

AdaBoost

The stopping time T can be selected according to the state of learning.

Parametric approach to Bayes classifier, but dimension and basis function are flexible

102

U-Boost

U-empirical loss function

In a context of classification

Unnormalized U-loss

Normalized U-loss

103

U-Boost (binary)

Unnormalized U-loss

Note

Bayes risk consistency

F*(x) is Bayes risk consistent because104

Eta-loss function

regularized

AdaBoost with margin

105

EtaBoost for Mislabels Expected Eta-loss function

Optimal score

The variational argument leads to

Mislabel modeling

106

EtaBoost

0)(),1()(: settings Initial.1 011

xFniiw n 　　　

,)())((I)(1

iwfyf m

n

iiim

x

T

tttTT fFF

1)( )()( where,)(sign.3 )( xxx

Tm ,,1For .2

))(exp()()()c( )(*

1 iimmmm yfiwiw x

)(min)()a( )( ff mfmm

(b)

107

A toy example

108

Examples partly mislabeled

109

AdaBoost vs. EtaBoost

AdaBoost EtaBoost110

ROC curve, AUC

cF 閾値：判別関数 ),(: X

)1Y|)(()TPR(

0)Y| )(()FPR(

cFPc

cFPc

X

X

}RI|))(TPR),(FPR{()ROC( cccF

111

0)(Y )(1)(Y )(

陰性

陽性

cFcF

XX

1} {0, ,RI , ),( YY pXX

)'(FPR)'(TPR),AUC( cdcF

TPR

FPR

1

10

c

AUC

偽陽性確率

真の陽性確率

受信者動作特性(ROC) 曲線

下側面積(AUC )

Cf. Bamber (1975), Pepe et al. (2003)

118

λ5-fold CVの結果（AUCBoost）

118

2標本問題とAUCブーストの関係

AUCブーストの学習アルゴリズムは逐次的に各 t-ステップで

得られた判別関数 Ft(x) のウィルコクソン検定統計量

を(t+1)-ステップ

で最大方向に増大するよう設計されている．

このように

となる中で全ての単点解析が統合されている．

この意味で全ての単点解析はブースティングに

組むことは可能である． Eguchi-Komori (2009)

))()((1)(ˆ0

1 11

10

1 0

it

n

k

n

iktt FFI

nnFC xx

)()()( 111 xxx tttt fFF

)(ˆ)(ˆ1tt FCFC

122

Partial AUC

))( ),X()X((P

)(FPR)(TPRpAUC

010 XFcFF

cdcc

)(xF

健常者病人

cTPR

FPR

1

10

)(TPR c

)(FPR c )(xF cTPR

FPR

1

10

pAUC

)(FPR c

)(TPR c

健常者病人

cc

pAUC

125

Robust for noninformative genes

127

Robustness for noninformative genes

128

生成関数

真のU 陽性確率

129

Boost learning algorithm

130

Bayes risk consistency

131

ベンチマークデータ

132

T r u e d e n s i t y

- 4

- 2

0

2

4

- 4

- 2

0

2

4

0

0 . 0 0 5

0 . 0 1

0 . 0 1 5

0 . 0 2

- 4

- 2

0

2

4

W dictionary133

U-Boost learning

)( argmin *

* fLf Uf UW

pfUf

nfL

n

iiU

RI1d)))((())((1)( xxx U-loss function

)( ))((co 1* WW U

Dictionary of density functions

},1d)(,0)(:)({ xxxx gggW

Learning space = U-model

)}( ))(({ 1

xg

Goal : find

)())0,1,,0(,(Then.))((),(Let )(

1 )( xxxπx

gfgf

WW * U

134

Inner step in the convex hull

))((co1* WW U}:),({ xW

*UW

W

)(xf

1g

2g3g

4g

5g

6g

7g

)(xf

)( argmin *

* fLf Uf UW

Goal : )))(ˆ(ˆ))(ˆ(ˆ()( ˆˆ111* xxx kk fff

135

Convergence of Algorithm

)()()(

)()()1()(

*11

111

fgg

gff

KK

KKKKK

*UW

-mixture

)()()()()( *111 fffff kkk

),()()( 11 kkUkUkU ffDfLfL

k

jjjUkUU ffDfLfL

1111 ),()()(

136

Simplified boosting

initial with the

))),()()1(((minarg

such that))()()1((

,1,....,0For

1

11

1111

1

ckc

gfLg

gfff

Kk

k

kUk

kkkkkk

g

W

K

kkkK gf

1

1 )( )(ˆ

)(minarg1 gLf Ug W

137

Then we have

Non-asymptotic bound

UbU 2

))(co(),,()}()(){(sup)A(

WWW

),(inf),(FA fgDg Uf

UU=W

=W

|}| )(EI))((supEI2),(EE1

1{ ffg pi

n

inf

g

xW

W

constant)length-step:(1

),(IE22

ccK

bcK U

W

(Functional approximation)

(Estimation error)

(Iteration effect)

),(EEand),(FAbetween Trade Remark. WW pp *U

where

),,IE( ),EE(),FA()ˆ,(EI * WWW KggfgD UKUg

Theorem. Assume that a data distribution has a density g(x) and that

138

Lemmas

Lemma 1 Then).))(((andestimator density abe~Let 11* N

j jjpff x

).,,~()}()~(){1()())))(())(~()1((( ***1

1 fffLfLfLfLp UUUUNj jUj

xx

Lemma 2 Then.))(co(in be~Let Wf

])1,0[(),,( 22* UU bff

Lemma 3 (A)assumptionUnder

))((where1

)(inf)ˆ( 122

jjU

UKU fcK

bcfLfL

Lemma 4 then,)(inf)ˆ( If

fLfL UKU

||)(EI))((1||supEI2),(inf)ˆ,(EI

ffn

fgDfgD gif

gUKUg xW

139

KDERSDE

-Boost -Boost

C (skewed-unimodal) L (quadrimodal) H (bimodal)

140

geometry

learning

statistics

quantum

Future of information geometry

Deepening

Optimal transform

Free probability

Semiparametrics

Compressed sensing

Poincare conjecture

Dynamical systems

Mathematical evolutionary theory

Dempster-Shafer theory141

Poincaré conjecture 1904

Every simply connected, closed 3-manifold is

homeomorphic to the 3-sphere

Ricci flow by Richard Hamilton in 1981 Perelmann 2002, 2003

Optimal control theory due to Pontryagin and Bellman

Optimal transport

Wasserstein space }~,~:||||E{min),( 2W gYfXYXgfd

142

Optimal transport

Theorem (Brenier, 1991)

Talagrand inequality

Log Sovolev inequality

Optimal transport theory is extending on a general manifold

Geometers consider not a space, but a distribution family on the space.

C. Villani (2010) Optimal transport theory as Fields medal

143

Thank you

144

information geometry...outline information geometry historical remarks treasure example information...

Documents