map estimation introduction

38
Bayes Theorem MAP Estimate Summary Bibliography MAP Estimate and its Periphery kzky 2011/4/24 kzky MAP Estimate and its Periphery 2011/4/24 1 / 22

Upload: yoshiyama-kazuki

Post on 16-Jul-2015

230 views

Category:

Engineering


1 download

TRANSCRIPT

Bayes Theorem MAP Estimate Summary Bibliography

MAP Estimate and its Periphery

kzky

2011/4/24

kzky MAP Estimate and its Periphery 2011/4/24 1 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Outline

1 Bayes TheoremTwo Views of Bayes TheoremChain Rule

2 MAP EstimateIntroductionRidge RegressionLogistic RegressionLog Linear ModelLoss FunctionGaussian Process

3 SummaryMAP Estimation SummaryFurther and Other Topics

4 BibliographyBibliography

kzky MAP Estimate and its Periphery 2011/4/24 2 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Outline

1 Bayes TheoremTwo Views of Bayes TheoremChain Rule

2 MAP EstimateIntroductionRidge RegressionLogistic RegressionLog Linear ModelLoss FunctionGaussian Process

3 SummaryMAP Estimation SummaryFurther and Other Topics

4 BibliographyBibliography

kzky MAP Estimate and its Periphery 2011/4/24 3 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Two Views of Bayes Theorem

Bayes Theorem

Bayes Theorem

p(x|y) =p(x, y)

p(y)=p(x)p(y|x)

p(y)=

p(x)p(y|x)∑x p(x)p(y|x)

begining with joint distribution

p(x, y) = p(x)p(y|x)

kzky MAP Estimate and its Periphery 2011/4/24 4 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Chain Rule

Chain Rule

Chine Rule

p(x, y, z) = p(x)p(y, z|x)

= p(x)p(y|x)p(z|x, y)

kzky MAP Estimate and its Periphery 2011/4/24 5 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Outline

1 Bayes TheoremTwo Views of Bayes TheoremChain Rule

2 MAP EstimateIntroductionRidge RegressionLogistic RegressionLog Linear ModelLoss FunctionGaussian Process

3 SummaryMAP Estimation SummaryFurther and Other Topics

4 BibliographyBibliography

kzky MAP Estimate and its Periphery 2011/4/24 6 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Introduction

Setting for MAP Estimate

Setting in usual supervised learning and Assumption

D = {(xi, yi)}ni=1i.i.d∼ P (x, y)

where x ∈ Rd, y ∈ {1,−1}, P : unknown joint distribution

With Bayes Theorem and assumption of x not depending on θ

p(θ|x, y) =p(θ)p(x, y|θ)

p(x, y)=

p(θ)p(y|x,θ)p(x|θ)

p(y|x)p(x)

=

p(θ)p(y|x,θ)

p(y|x)

posterior =prior × likelihoodmarginal likelihood

kzky MAP Estimate and its Periphery 2011/4/24 7 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Introduction

Setting for MAP Estimate

Setting in usual supervised learning and Assumption

D = {(xi, yi)}ni=1i.i.d∼ P (x, y)

where x ∈ Rd, y ∈ {1,−1}, P : unknown joint distribution

With Bayes Theorem and assumption of x not depending on θ

p(θ|x, y) =p(θ)p(x, y|θ)

p(x, y)=p(θ)p(y|x,θ)p(x|θ)

p(y|x)p(x)=

p(θ)p(y|x,θ)

p(y|x)

posterior =prior × likelihoodmarginal likelihood

kzky MAP Estimate and its Periphery 2011/4/24 7 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Introduction

Setting for MAP Estimate

Setting in usual supervised learning and Assumption

D = {(xi, yi)}ni=1i.i.d∼ P (x, y)

where x ∈ Rd, y ∈ {1,−1}, P : unknown joint distribution

With Bayes Theorem and assumption of x not depending on θ

p(θ|x, y) =p(θ)p(x, y|θ)

p(x, y)=p(θ)p(y|x,θ)p(x|θ)

p(y|x)p(x)=p(θ)p(y|x,θ)

p(y|x)

posterior =prior × likelihoodmarginal likelihood

kzky MAP Estimate and its Periphery 2011/4/24 7 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Introduction

Setting for MAP Estimate

Setting in usual supervised learning and Assumption

D = {(xi, yi)}ni=1i.i.d∼ P (x, y)

where x ∈ Rd, y ∈ {1,−1}, P : unknown joint distribution

With Bayes Theorem and assumption of x not depending on θ

p(θ|x, y) =p(θ)p(x, y|θ)

p(x, y)=p(θ)p(y|x,θ)p(x|θ)

p(y|x)p(x)=p(θ)p(y|x,θ)

p(y|x)

posterior =prior × likelihoodmarginal likelihood

kzky MAP Estimate and its Periphery 2011/4/24 7 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Introduction

MAP Estimate

Maximum A Posteriori Estimate

basically take log

maximize p(θ|D) with respect to θ

we can maximize a priori with respect to θ without loss of generality.because log is monotonic.

kzky MAP Estimate and its Periphery 2011/4/24 8 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Introduction

Formulation on MAP Estimate

maxθ

log p (θ|D)

= maxθ

(log (p (θ) p (D|θ))− log p (D) )

= maxθ

(log p (θ) + log

∏i

p (xi, yi|θ)

)

= maxθ

(log p (θ) +

∑i

log p (yi|xi,θ) +

∑i

log p (xi|θ)

)

= maxθ

(log p (θ) +

∑i

log p (yi|xi,θ)

)

kzky MAP Estimate and its Periphery 2011/4/24 9 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Introduction

Formulation on MAP Estimate

maxθ

log p (θ|D)

= maxθ

(log (p (θ) p (D|θ))−

log p (D)

)

= maxθ

(log p (θ) + log

∏i

p (xi, yi|θ)

)

= maxθ

(log p (θ) +

∑i

log p (yi|xi,θ) +

∑i

log p (xi|θ)

)

= maxθ

(log p (θ) +

∑i

log p (yi|xi,θ)

)

kzky MAP Estimate and its Periphery 2011/4/24 9 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Introduction

Formulation on MAP Estimate

maxθ

log p (θ|D)

= maxθ

(log (p (θ) p (D|θ))−

log p (D)

)

= maxθ

(log p (θ) + log

∏i

p (xi, yi|θ)

)

= maxθ

(log p (θ) +

∑i

log p (yi|xi,θ) +∑i

log p (xi|θ)

)

= maxθ

(log p (θ) +

∑i

log p (yi|xi,θ)

)

kzky MAP Estimate and its Periphery 2011/4/24 9 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Introduction

Formulation on MAP Estimate

maxθ

log p (θ|D)

= maxθ

(log (p (θ) p (D|θ))−

log p (D)

)

= maxθ

(log p (θ) + log

∏i

p (xi, yi|θ)

)

= maxθ

(log p (θ) +

∑i

log p (yi|xi,θ) +

∑i

log p (xi|θ)

)

= maxθ

(log p (θ) +

∑i

log p (yi|xi,θ)

)

kzky MAP Estimate and its Periphery 2011/4/24 9 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Introduction

Formulation only on Regularization Term

Assumption of Prior

w ≡ θ ∼ N(

0,I

)p (w) =

1

(2π)d/2| I2λ |1/2exp

(−1

2wT

(I

)−1

w

)=

1

(2π)d/2| I2λ |1/2exp

(−λ‖w‖22

)

maxθ

log p (θ|D) = maxw

(−λ‖w‖22 +

∑i

log p (yi|xi,w)

)

kzky MAP Estimate and its Periphery 2011/4/24 10 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Introduction

Formulation only on Regularization Term

Assumption of Prior

w ≡ θ ∼ N(

0,I

)p (w) =

1

(2π)d/2| I2λ |1/2exp

(−1

2wT

(I

)−1

w

)=

1

(2π)d/2| I2λ |1/2exp

(−λ‖w‖22

)

maxθ

log p (θ|D) = maxw

(−λ‖w‖22 +

∑i

log p (yi|xi,w)

)

kzky MAP Estimate and its Periphery 2011/4/24 10 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Ridge Regression

Ridge Regression

Asumption of p(y|x,w)

p(y|x,w) =1√2πσ

exp

[−(y − f(x))2

2σ2

]MAP Estimate becames

maxw

(−λ‖w‖22 −

∑i

(yi − f (xi))2

2σ2

)

= minw

(λ‖w‖22 +

∑i

(yi − f (xi))2

2σ2

)

= minw

(λ‖w‖22 +

1

2σ2(y −Xw)T (y −Xw)

)

kzky MAP Estimate and its Periphery 2011/4/24 11 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Ridge Regression

Ridge Regression

Asumption of p(y|x,w)

p(y|x,w) =1√2πσ

exp

[−(y − f(x))2

2σ2

]MAP Estimate becames

maxw

(−λ‖w‖22 −

∑i

(yi − f (xi))2

2σ2

)

= minw

(λ‖w‖22 +

∑i

(yi − f (xi))2

2σ2

)

= minw

(λ‖w‖22 +

1

2σ2(y −Xw)T (y −Xw)

)

kzky MAP Estimate and its Periphery 2011/4/24 11 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Ridge Regression

Ridge Regression

Asumption of p(y|x,w)

p(y|x,w) =1√2πσ

exp

[−(y − f(x))2

2σ2

]MAP Estimate becames

maxw

(−λ‖w‖22 −

∑i

(yi − f (xi))2

2σ2

)

= minw

(λ‖w‖22 +

∑i

(yi − f (xi))2

2σ2

)

= minw

(λ‖w‖22 +

1

2σ2(y −Xw)T (y −Xw)

)

kzky MAP Estimate and its Periphery 2011/4/24 11 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Ridge Regression

Ridge Regression

Asumption of p(y|x,w)

p(y|x,w) =1√2πσ

exp

[−(y − f(x))2

2σ2

]MAP Estimate becames

maxw

(−λ‖w‖22 −

∑i

(yi − f (xi))2

2σ2

)

= minw

(λ‖w‖22 +

∑i

(yi − f (xi))2

2σ2

)

= minw

(λ‖w‖22 +

1

2σ2(y −Xw)T (y −Xw)

)kzky MAP Estimate and its Periphery 2011/4/24 11 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Logistic Regression

Logistic Regression

Asumption of p(y|x,w)

p(y|x,w) =1

1 + exp (−yf (x))

MAP Estimate becames

maxw

(−λ‖w‖22 +

∑i

log

(1

1 + exp (−yif (xi))

))

= minw

(λ‖w‖22 +

∑i

log (1 + exp (−yif (xi)))

)

kzky MAP Estimate and its Periphery 2011/4/24 12 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Logistic Regression

Logistic Regression

Asumption of p(y|x,w)

p(y|x,w) =1

1 + exp (−yf (x))

MAP Estimate becames

maxw

(−λ‖w‖22 +

∑i

log

(1

1 + exp (−yif (xi))

))

= minw

(λ‖w‖22 +

∑i

log (1 + exp (−yif (xi)))

)

kzky MAP Estimate and its Periphery 2011/4/24 12 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Logistic Regression

Logistic Regression

Asumption of p(y|x,w)

p(y|x,w) =1

1 + exp (−yf (x))

MAP Estimate becames

maxw

(−λ‖w‖22 +

∑i

log

(1

1 + exp (−yif (xi))

))

= minw

(λ‖w‖22 +

∑i

log (1 + exp (−yif (xi)))

)

kzky MAP Estimate and its Periphery 2011/4/24 12 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Log Linear Model

Log Liner Model

Asumption of p(y|x,w)

p (y|x,w) = 1Zx,w

exp(wTφ (x, y)

)Zx,w is normalization for exp(wTφ(x, y)) with respect to y

MAP Estimate becames

maxw

(−λ‖w‖22 +

∑i

wTφ (xi, yi)− lnZx,w

)

kzky MAP Estimate and its Periphery 2011/4/24 13 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Log Linear Model

Log Liner Model

Asumption of p(y|x,w)

p (y|x,w) = 1Zx,w

exp(wTφ (x, y)

)Zx,w is normalization for exp(wTφ(x, y)) with respect to y

MAP Estimate becames

maxw

(−λ‖w‖22 +

∑i

wTφ (xi, yi)− lnZx,w

)

kzky MAP Estimate and its Periphery 2011/4/24 13 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Loss Function

Figures: Loss Function

kzky MAP Estimate and its Periphery 2011/4/24 14 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Gaussian Process

Main points of Gaussian Process

Differences from previous discussion

do not take log

do not use any distribution other than Gaussian

only Gaussina distribution used

Concept of GP

xi.i.d∼ N (x,Σx)

yi.i.d∼ N (y,Σy)

p(x)p(y) is also Gaussian

kzky MAP Estimate and its Periphery 2011/4/24 15 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Gaussian Process

Formulation of GP

begining with Bayes Theorem

p (θ|x, y) ∝ p (y|x,w) p (w)

convert into the form (x− x)TΣ(x− x)

p (θ|D) = exp

(− 1

2σ2(y −Xw)T (y −Xw)

)exp

(−1

2wΣ−1

w w

)

= exp

[−1

2(w −w)

(1

σ2XXT + Σ−1

w

)(w −w)

]where w =

1

σ2

(1

σ2XXT + Σ−1

w

)−1

Xy

kzky MAP Estimate and its Periphery 2011/4/24 16 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Gaussian Process

Formulation of GP

begining with Bayes Theorem

p (θ|x, y) ∝ p (y|x,w) p (w)

convert into the form (x− x)TΣ(x− x)

p (θ|D) = exp

(− 1

2σ2(y −Xw)T (y −Xw)

)exp

(−1

2wΣ−1

w w

)

= exp

[−1

2(w −w)

(1

σ2XXT + Σ−1

w

)(w −w)

]where w =

1

σ2

(1

σ2XXT + Σ−1

w

)−1

Xy

kzky MAP Estimate and its Periphery 2011/4/24 16 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Gaussian Process

Formulation of GP

begining with Bayes Theorem

p (θ|x, y) ∝ p (y|x,w) p (w)

convert into the form (x− x)TΣ(x− x)

p (θ|D) = exp

(− 1

2σ2(y −Xw)T (y −Xw)

)exp

(−1

2wΣ−1

w w

)= exp

[−1

2(w −w)

(1

σ2XXT + Σ−1

w

)(w −w)

]where w =

1

σ2

(1

σ2XXT + Σ−1

w

)−1

Xy

kzky MAP Estimate and its Periphery 2011/4/24 16 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Gaussian Process

Notice1 expandable to kernelization

1 mapping x onto Feature Space (i.e. high dimensional space)x 7→ φ(x)

2 inner product of feature vectors occurs (i.e. φTφ)

2 solvable analytically

1 similarity calculation between all training samples x and testsample xnew

2 gram matrix calculation, then calculate only the inverse matrix

3 easy to impliment(e.g., using a library to obtain an inverse matrix)

f(xnew) =∑i

αik(xnew,x)

where α = (K + σ2I)−1y

kzky MAP Estimate and its Periphery 2011/4/24 17 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Outline

1 Bayes TheoremTwo Views of Bayes TheoremChain Rule

2 MAP EstimateIntroductionRidge RegressionLogistic RegressionLog Linear ModelLoss FunctionGaussian Process

3 SummaryMAP Estimation SummaryFurther and Other Topics

4 BibliographyBibliography

kzky MAP Estimate and its Periphery 2011/4/24 18 / 22

Bayes Theorem MAP Estimate Summary Bibliography

MAP Estimation Summary

MAP Estimation Summary

Good things of MAP Estimate are:

able to find Global Minima if we choose convex loss function

easy to understand and cast other interpretation to SVM

some models (e.g., GP) are solvable analytically

expandability:

1 we can change p(y|x,θ) into various distributions2 easy to convert supervised model into SSL using p(x|θ) term3 modifiability to sequantial labeling

(e.g., log linear model to Conditional Random Field)

*GP for ML is freely downloadable fromhttp://www.gaussianprocess.org/gpml/chapters/

kzky MAP Estimate and its Periphery 2011/4/24 19 / 22

Bayes Theorem MAP Estimate Summary Bibliography

MAP Estimation Summary

MAP Estimation Summary

Good things of MAP Estimate are:

able to find Global Minima if we choose convex loss function

easy to understand and cast other interpretation to SVM

some models (e.g., GP) are solvable analytically

expandability:

1 we can change p(y|x,θ) into various distributions2 easy to convert supervised model into SSL using p(x|θ) term3 modifiability to sequantial labeling

(e.g., log linear model to Conditional Random Field)

*GP for ML is freely downloadable fromhttp://www.gaussianprocess.org/gpml/chapters/

kzky MAP Estimate and its Periphery 2011/4/24 19 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Further and Other Topics

Further and Other Topics

Relationships

1 Bayse Estimation: find a function of θ (but no guarantee for globalsolution)

2 Maximum (Log) Likelihood (e.g., EM for GMM and HMM)3 Naive Bayes: p(θ) ∼ Dirichlet and p(y|x,θ) ∼ multinominal

SSlize (expansion of MAP Estimate Case)

1 Entropy Regularization to Logistic Regression (nips 2005)2 Null Categorial Noise Model to Gaussian Process (nips 2005)

kzky MAP Estimate and its Periphery 2011/4/24 20 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Outline

1 Bayes TheoremTwo Views of Bayes TheoremChain Rule

2 MAP EstimateIntroductionRidge RegressionLogistic RegressionLog Linear ModelLoss FunctionGaussian Process

3 SummaryMAP Estimation SummaryFurther and Other Topics

4 BibliographyBibliography

kzky MAP Estimate and its Periphery 2011/4/24 21 / 22

Bayes Theorem MAP Estimate Summary Bibliography

Bibliography

Bibliography

1 S.Akaho, “Kernel Maltiple Analysis”, Iwanami 2009

2 D.Takamura and M.Okumura, “Introductino to Machine Learningfor Natural Language Processing”, Corona 2010

3 X.Zhu, “Introduction to Semi-Supervised Learning”, Morgan &Claypool Publishers 2009

4 X.Zhu, “Semi-Supervised Learning Literature Survey”, 2008

5 C.Rasmussen and C.Williams, “Gaussian Process for MachineLearning”, the MIT Press 2006

kzky MAP Estimate and its Periphery 2011/4/24 22 / 22