map estimation introduction

Bayes Theorem MAP Estimate Summary Bibliography

MAP Estimate and its Periphery

kzky

2011/4/24

kzky MAP Estimate and its Periphery 2011/4/24 1 / 22


Outline

1 Bayes TheoremTwo Views of Bayes TheoremChain Rule

2 MAP EstimateIntroductionRidge RegressionLogistic RegressionLog Linear ModelLoss FunctionGaussian Process

3 SummaryMAP Estimation SummaryFurther and Other Topics

4 BibliographyBibliography



Outline







Chain Rule

Chain Rule

Chine Rule

p(x, y, z) = p(x)p(y, z|x)

= p(x)p(y|x)p(z|x, y)



Outline







Introduction

Setting for MAP Estimate

Setting in usual supervised learning and Assumption

D = {(xi, yi)}ni=1i.i.d∼ P (x, y)

where x ∈ Rd, y ∈ {1,−1}, P : unknown joint distribution

With Bayes Theorem and assumption of x not depending on θ

p(θ|x, y) =p(θ)p(x, y|θ)

p(x, y)=

p(θ)p(y|x,θ)p(x|θ)

p(y|x)p(x)

=

p(θ)p(y|x,θ)

p(y|x)

posterior =prior × likelihoodmarginal likelihood



Introduction



D = {(xi, yi)}ni=1i.i.d∼ P (x, y)




p(x, y)=p(θ)p(y|x,θ)p(x|θ)

p(y|x)p(x)=

p(θ)p(y|x,θ)

p(y|x)




Introduction



D = {(xi, yi)}ni=1i.i.d∼ P (x, y)




p(x, y)=p(θ)p(y|x,θ)p(x|θ)

p(y|x)p(x)=p(θ)p(y|x,θ)

p(y|x)




Introduction

MAP Estimate

Maximum A Posteriori Estimate

basically take log

maximize p(θ|D) with respect to θ

we can maximize a priori with respect to θ without loss of generality.because log is monotonic.



Introduction

Formulation only on Regularization Term

Assumption of Prior

w ≡ θ ∼ N(

0,I

2λ

)p (w) =

1

(2π)d/2| I2λ |1/2exp

(−1

2wT

(I

2λ

)−1

w

)=

1

(2π)d/2| I2λ |1/2exp

(−λ‖w‖22

)

maxθ

log p (θ|D) = maxw

(−λ‖w‖22 +

∑i

log p (yi|xi,w)

)



Ridge Regression

Ridge Regression

Asumption of p(y|x,w)

p(y|x,w) =1√2πσ

exp

[−(y − f(x))2

2σ2

]MAP Estimate becames

maxw

(−λ‖w‖22 −

∑i

(yi − f (xi))2

2σ2

)

= minw

(λ‖w‖22 +

∑i

(yi − f (xi))2

2σ2

)

= minw

(λ‖w‖22 +

1

2σ2(y −Xw)T (y −Xw)

)



Ridge Regression

Ridge Regression


p(y|x,w) =1√2πσ

exp

[−(y − f(x))2

2σ2

]MAP Estimate becames

maxw

(−λ‖w‖22 −

∑i

(yi − f (xi))2

2σ2

)

= minw

(λ‖w‖22 +

∑i

(yi − f (xi))2

2σ2

)

= minw

(λ‖w‖22 +

1


)kzky MAP Estimate and its Periphery 2011/4/24 11 / 22


Logistic Regression

Logistic Regression


p(y|x,w) =1

1 + exp (−yf (x))

MAP Estimate becames

maxw

(−λ‖w‖22 +

∑i

log

(1

1 + exp (−yif (xi))

))

= minw

(λ‖w‖22 +

∑i

log (1 + exp (−yif (xi)))

)



Log Linear Model

Log Liner Model


p (y|x,w) = 1Zx,w

exp(wTφ (x, y)

)Zx,w is normalization for exp(wTφ(x, y)) with respect to y

MAP Estimate becames

maxw

(−λ‖w‖22 +

∑i

wTφ (xi, yi)− lnZx,w

)



Loss Function

Figures: Loss Function



Gaussian Process

Main points of Gaussian Process

Differences from previous discussion

do not take log

do not use any distribution other than Gaussian

only Gaussina distribution used

Concept of GP

xi.i.d∼ N (x,Σx)

yi.i.d∼ N (y,Σy)

p(x)p(y) is also Gaussian



Gaussian Process

Formulation of GP

begining with Bayes Theorem

p (θ|x, y) ∝ p (y|x,w) p (w)

convert into the form (x− x)TΣ(x− x)

p (θ|D) = exp

(− 1


)exp

(−1

2wΣ−1

w w

)

= exp

[−1

2(w −w)

(1

σ2XXT + Σ−1

w

)(w −w)

]where w =

1

σ2

(1

σ2XXT + Σ−1

w

)−1

Xy



Gaussian Process

Formulation of GP

begining with Bayes Theorem

p (θ|x, y) ∝ p (y|x,w) p (w)

convert into the form (x− x)TΣ(x− x)

p (θ|D) = exp

(− 1


)exp

(−1

2wΣ−1

w w

)= exp

[−1

2(w −w)

(1

σ2XXT + Σ−1

w

)(w −w)

]where w =

1

σ2

(1

σ2XXT + Σ−1

w

)−1

Xy



Gaussian Process

Notice1 expandable to kernelization

1 mapping x onto Feature Space (i.e. high dimensional space)x 7→ φ(x)

2 inner product of feature vectors occurs (i.e. φTφ)

2 solvable analytically

1 similarity calculation between all training samples x and testsample xnew

2 gram matrix calculation, then calculate only the inverse matrix

3 easy to impliment(e.g., using a library to obtain an inverse matrix)

f(xnew) =∑i

αik(xnew,x)

where α = (K + σ2I)−1y



Outline







MAP Estimation Summary

MAP Estimation Summary

Good things of MAP Estimate are:

able to find Global Minima if we choose convex loss function

easy to understand and cast other interpretation to SVM

some models (e.g., GP) are solvable analytically

expandability:

1 we can change p(y|x,θ) into various distributions2 easy to convert supervised model into SSL using p(x|θ) term3 modifiability to sequantial labeling

(e.g., log linear model to Conditional Random Field)

*GP for ML is freely downloadable fromhttp://www.gaussianprocess.org/gpml/chapters/



Further and Other Topics

Further and Other Topics

Relationships

1 Bayse Estimation: find a function of θ (but no guarantee for globalsolution)

2 Maximum (Log) Likelihood (e.g., EM for GMM and HMM)3 Naive Bayes: p(θ) ∼ Dirichlet and p(y|x,θ) ∼ multinominal

SSlize (expansion of MAP Estimate Case)

1 Entropy Regularization to Logistic Regression (nips 2005)2 Null Categorial Noise Model to Gaussian Process (nips 2005)



Outline







Bibliography

Bibliography

1 S.Akaho, “Kernel Maltiple Analysis”, Iwanami 2009

2 D.Takamura and M.Okumura, “Introductino to Machine Learningfor Natural Language Processing”, Corona 2010

3 X.Zhu, “Introduction to Semi-Supervised Learning”, Morgan &Claypool Publishers 2009

4 X.Zhu, “Semi-Supervised Learning Literature Survey”, 2008

5 C.Rasmussen and C.Williams, “Gaussian Process for MachineLearning”, the MIT Press 2006


map estimation introduction

Engineering

pxpyxkzky map estimate

map estimatesetting

assumption of x

ywhere x rd

usual supervised learning

joint distributionpx

pxpyxx pxpyxbegining