map estimation introduction
TRANSCRIPT
Bayes Theorem MAP Estimate Summary Bibliography
MAP Estimate and its Periphery
kzky
2011/4/24
kzky MAP Estimate and its Periphery 2011/4/24 1 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Outline
1 Bayes TheoremTwo Views of Bayes TheoremChain Rule
2 MAP EstimateIntroductionRidge RegressionLogistic RegressionLog Linear ModelLoss FunctionGaussian Process
3 SummaryMAP Estimation SummaryFurther and Other Topics
4 BibliographyBibliography
kzky MAP Estimate and its Periphery 2011/4/24 2 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Outline
1 Bayes TheoremTwo Views of Bayes TheoremChain Rule
2 MAP EstimateIntroductionRidge RegressionLogistic RegressionLog Linear ModelLoss FunctionGaussian Process
3 SummaryMAP Estimation SummaryFurther and Other Topics
4 BibliographyBibliography
kzky MAP Estimate and its Periphery 2011/4/24 3 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Two Views of Bayes Theorem
Bayes Theorem
Bayes Theorem
p(x|y) =p(x, y)
p(y)=p(x)p(y|x)
p(y)=
p(x)p(y|x)∑x p(x)p(y|x)
begining with joint distribution
p(x, y) = p(x)p(y|x)
kzky MAP Estimate and its Periphery 2011/4/24 4 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Chain Rule
Chain Rule
Chine Rule
p(x, y, z) = p(x)p(y, z|x)
= p(x)p(y|x)p(z|x, y)
kzky MAP Estimate and its Periphery 2011/4/24 5 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Outline
1 Bayes TheoremTwo Views of Bayes TheoremChain Rule
2 MAP EstimateIntroductionRidge RegressionLogistic RegressionLog Linear ModelLoss FunctionGaussian Process
3 SummaryMAP Estimation SummaryFurther and Other Topics
4 BibliographyBibliography
kzky MAP Estimate and its Periphery 2011/4/24 6 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Introduction
Setting for MAP Estimate
Setting in usual supervised learning and Assumption
D = {(xi, yi)}ni=1i.i.d∼ P (x, y)
where x ∈ Rd, y ∈ {1,−1}, P : unknown joint distribution
With Bayes Theorem and assumption of x not depending on θ
p(θ|x, y) =p(θ)p(x, y|θ)
p(x, y)=
p(θ)p(y|x,θ)p(x|θ)
p(y|x)p(x)
=
p(θ)p(y|x,θ)
p(y|x)
posterior =prior × likelihoodmarginal likelihood
kzky MAP Estimate and its Periphery 2011/4/24 7 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Introduction
Setting for MAP Estimate
Setting in usual supervised learning and Assumption
D = {(xi, yi)}ni=1i.i.d∼ P (x, y)
where x ∈ Rd, y ∈ {1,−1}, P : unknown joint distribution
With Bayes Theorem and assumption of x not depending on θ
p(θ|x, y) =p(θ)p(x, y|θ)
p(x, y)=p(θ)p(y|x,θ)p(x|θ)
p(y|x)p(x)=
p(θ)p(y|x,θ)
p(y|x)
posterior =prior × likelihoodmarginal likelihood
kzky MAP Estimate and its Periphery 2011/4/24 7 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Introduction
Setting for MAP Estimate
Setting in usual supervised learning and Assumption
D = {(xi, yi)}ni=1i.i.d∼ P (x, y)
where x ∈ Rd, y ∈ {1,−1}, P : unknown joint distribution
With Bayes Theorem and assumption of x not depending on θ
p(θ|x, y) =p(θ)p(x, y|θ)
p(x, y)=p(θ)p(y|x,θ)p(x|θ)
p(y|x)p(x)=p(θ)p(y|x,θ)
p(y|x)
posterior =prior × likelihoodmarginal likelihood
kzky MAP Estimate and its Periphery 2011/4/24 7 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Introduction
Setting for MAP Estimate
Setting in usual supervised learning and Assumption
D = {(xi, yi)}ni=1i.i.d∼ P (x, y)
where x ∈ Rd, y ∈ {1,−1}, P : unknown joint distribution
With Bayes Theorem and assumption of x not depending on θ
p(θ|x, y) =p(θ)p(x, y|θ)
p(x, y)=p(θ)p(y|x,θ)p(x|θ)
p(y|x)p(x)=p(θ)p(y|x,θ)
p(y|x)
posterior =prior × likelihoodmarginal likelihood
kzky MAP Estimate and its Periphery 2011/4/24 7 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Introduction
MAP Estimate
Maximum A Posteriori Estimate
basically take log
maximize p(θ|D) with respect to θ
we can maximize a priori with respect to θ without loss of generality.because log is monotonic.
kzky MAP Estimate and its Periphery 2011/4/24 8 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Introduction
Formulation on MAP Estimate
maxθ
log p (θ|D)
= maxθ
(log (p (θ) p (D|θ))− log p (D) )
= maxθ
(log p (θ) + log
∏i
p (xi, yi|θ)
)
= maxθ
(log p (θ) +
∑i
log p (yi|xi,θ) +
∑i
log p (xi|θ)
)
= maxθ
(log p (θ) +
∑i
log p (yi|xi,θ)
)
kzky MAP Estimate and its Periphery 2011/4/24 9 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Introduction
Formulation on MAP Estimate
maxθ
log p (θ|D)
= maxθ
(log (p (θ) p (D|θ))−
log p (D)
)
= maxθ
(log p (θ) + log
∏i
p (xi, yi|θ)
)
= maxθ
(log p (θ) +
∑i
log p (yi|xi,θ) +
∑i
log p (xi|θ)
)
= maxθ
(log p (θ) +
∑i
log p (yi|xi,θ)
)
kzky MAP Estimate and its Periphery 2011/4/24 9 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Introduction
Formulation on MAP Estimate
maxθ
log p (θ|D)
= maxθ
(log (p (θ) p (D|θ))−
log p (D)
)
= maxθ
(log p (θ) + log
∏i
p (xi, yi|θ)
)
= maxθ
(log p (θ) +
∑i
log p (yi|xi,θ) +∑i
log p (xi|θ)
)
= maxθ
(log p (θ) +
∑i
log p (yi|xi,θ)
)
kzky MAP Estimate and its Periphery 2011/4/24 9 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Introduction
Formulation on MAP Estimate
maxθ
log p (θ|D)
= maxθ
(log (p (θ) p (D|θ))−
log p (D)
)
= maxθ
(log p (θ) + log
∏i
p (xi, yi|θ)
)
= maxθ
(log p (θ) +
∑i
log p (yi|xi,θ) +
∑i
log p (xi|θ)
)
= maxθ
(log p (θ) +
∑i
log p (yi|xi,θ)
)
kzky MAP Estimate and its Periphery 2011/4/24 9 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Introduction
Formulation only on Regularization Term
Assumption of Prior
w ≡ θ ∼ N(
0,I
2λ
)p (w) =
1
(2π)d/2| I2λ |1/2exp
(−1
2wT
(I
2λ
)−1
w
)=
1
(2π)d/2| I2λ |1/2exp
(−λ‖w‖22
)
maxθ
log p (θ|D) = maxw
(−λ‖w‖22 +
∑i
log p (yi|xi,w)
)
kzky MAP Estimate and its Periphery 2011/4/24 10 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Introduction
Formulation only on Regularization Term
Assumption of Prior
w ≡ θ ∼ N(
0,I
2λ
)p (w) =
1
(2π)d/2| I2λ |1/2exp
(−1
2wT
(I
2λ
)−1
w
)=
1
(2π)d/2| I2λ |1/2exp
(−λ‖w‖22
)
maxθ
log p (θ|D) = maxw
(−λ‖w‖22 +
∑i
log p (yi|xi,w)
)
kzky MAP Estimate and its Periphery 2011/4/24 10 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Ridge Regression
Ridge Regression
Asumption of p(y|x,w)
p(y|x,w) =1√2πσ
exp
[−(y − f(x))2
2σ2
]MAP Estimate becames
maxw
(−λ‖w‖22 −
∑i
(yi − f (xi))2
2σ2
)
= minw
(λ‖w‖22 +
∑i
(yi − f (xi))2
2σ2
)
= minw
(λ‖w‖22 +
1
2σ2(y −Xw)T (y −Xw)
)
kzky MAP Estimate and its Periphery 2011/4/24 11 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Ridge Regression
Ridge Regression
Asumption of p(y|x,w)
p(y|x,w) =1√2πσ
exp
[−(y − f(x))2
2σ2
]MAP Estimate becames
maxw
(−λ‖w‖22 −
∑i
(yi − f (xi))2
2σ2
)
= minw
(λ‖w‖22 +
∑i
(yi − f (xi))2
2σ2
)
= minw
(λ‖w‖22 +
1
2σ2(y −Xw)T (y −Xw)
)
kzky MAP Estimate and its Periphery 2011/4/24 11 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Ridge Regression
Ridge Regression
Asumption of p(y|x,w)
p(y|x,w) =1√2πσ
exp
[−(y − f(x))2
2σ2
]MAP Estimate becames
maxw
(−λ‖w‖22 −
∑i
(yi − f (xi))2
2σ2
)
= minw
(λ‖w‖22 +
∑i
(yi − f (xi))2
2σ2
)
= minw
(λ‖w‖22 +
1
2σ2(y −Xw)T (y −Xw)
)
kzky MAP Estimate and its Periphery 2011/4/24 11 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Ridge Regression
Ridge Regression
Asumption of p(y|x,w)
p(y|x,w) =1√2πσ
exp
[−(y − f(x))2
2σ2
]MAP Estimate becames
maxw
(−λ‖w‖22 −
∑i
(yi − f (xi))2
2σ2
)
= minw
(λ‖w‖22 +
∑i
(yi − f (xi))2
2σ2
)
= minw
(λ‖w‖22 +
1
2σ2(y −Xw)T (y −Xw)
)kzky MAP Estimate and its Periphery 2011/4/24 11 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Logistic Regression
Logistic Regression
Asumption of p(y|x,w)
p(y|x,w) =1
1 + exp (−yf (x))
MAP Estimate becames
maxw
(−λ‖w‖22 +
∑i
log
(1
1 + exp (−yif (xi))
))
= minw
(λ‖w‖22 +
∑i
log (1 + exp (−yif (xi)))
)
kzky MAP Estimate and its Periphery 2011/4/24 12 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Logistic Regression
Logistic Regression
Asumption of p(y|x,w)
p(y|x,w) =1
1 + exp (−yf (x))
MAP Estimate becames
maxw
(−λ‖w‖22 +
∑i
log
(1
1 + exp (−yif (xi))
))
= minw
(λ‖w‖22 +
∑i
log (1 + exp (−yif (xi)))
)
kzky MAP Estimate and its Periphery 2011/4/24 12 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Logistic Regression
Logistic Regression
Asumption of p(y|x,w)
p(y|x,w) =1
1 + exp (−yf (x))
MAP Estimate becames
maxw
(−λ‖w‖22 +
∑i
log
(1
1 + exp (−yif (xi))
))
= minw
(λ‖w‖22 +
∑i
log (1 + exp (−yif (xi)))
)
kzky MAP Estimate and its Periphery 2011/4/24 12 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Log Linear Model
Log Liner Model
Asumption of p(y|x,w)
p (y|x,w) = 1Zx,w
exp(wTφ (x, y)
)Zx,w is normalization for exp(wTφ(x, y)) with respect to y
MAP Estimate becames
maxw
(−λ‖w‖22 +
∑i
wTφ (xi, yi)− lnZx,w
)
kzky MAP Estimate and its Periphery 2011/4/24 13 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Log Linear Model
Log Liner Model
Asumption of p(y|x,w)
p (y|x,w) = 1Zx,w
exp(wTφ (x, y)
)Zx,w is normalization for exp(wTφ(x, y)) with respect to y
MAP Estimate becames
maxw
(−λ‖w‖22 +
∑i
wTφ (xi, yi)− lnZx,w
)
kzky MAP Estimate and its Periphery 2011/4/24 13 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Loss Function
Figures: Loss Function
kzky MAP Estimate and its Periphery 2011/4/24 14 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Gaussian Process
Main points of Gaussian Process
Differences from previous discussion
do not take log
do not use any distribution other than Gaussian
only Gaussina distribution used
Concept of GP
xi.i.d∼ N (x,Σx)
yi.i.d∼ N (y,Σy)
p(x)p(y) is also Gaussian
kzky MAP Estimate and its Periphery 2011/4/24 15 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Gaussian Process
Formulation of GP
begining with Bayes Theorem
p (θ|x, y) ∝ p (y|x,w) p (w)
convert into the form (x− x)TΣ(x− x)
p (θ|D) = exp
(− 1
2σ2(y −Xw)T (y −Xw)
)exp
(−1
2wΣ−1
w w
)
= exp
[−1
2(w −w)
(1
σ2XXT + Σ−1
w
)(w −w)
]where w =
1
σ2
(1
σ2XXT + Σ−1
w
)−1
Xy
kzky MAP Estimate and its Periphery 2011/4/24 16 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Gaussian Process
Formulation of GP
begining with Bayes Theorem
p (θ|x, y) ∝ p (y|x,w) p (w)
convert into the form (x− x)TΣ(x− x)
p (θ|D) = exp
(− 1
2σ2(y −Xw)T (y −Xw)
)exp
(−1
2wΣ−1
w w
)
= exp
[−1
2(w −w)
(1
σ2XXT + Σ−1
w
)(w −w)
]where w =
1
σ2
(1
σ2XXT + Σ−1
w
)−1
Xy
kzky MAP Estimate and its Periphery 2011/4/24 16 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Gaussian Process
Formulation of GP
begining with Bayes Theorem
p (θ|x, y) ∝ p (y|x,w) p (w)
convert into the form (x− x)TΣ(x− x)
p (θ|D) = exp
(− 1
2σ2(y −Xw)T (y −Xw)
)exp
(−1
2wΣ−1
w w
)= exp
[−1
2(w −w)
(1
σ2XXT + Σ−1
w
)(w −w)
]where w =
1
σ2
(1
σ2XXT + Σ−1
w
)−1
Xy
kzky MAP Estimate and its Periphery 2011/4/24 16 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Gaussian Process
Notice1 expandable to kernelization
1 mapping x onto Feature Space (i.e. high dimensional space)x 7→ φ(x)
2 inner product of feature vectors occurs (i.e. φTφ)
2 solvable analytically
1 similarity calculation between all training samples x and testsample xnew
2 gram matrix calculation, then calculate only the inverse matrix
3 easy to impliment(e.g., using a library to obtain an inverse matrix)
f(xnew) =∑i
αik(xnew,x)
where α = (K + σ2I)−1y
kzky MAP Estimate and its Periphery 2011/4/24 17 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Outline
1 Bayes TheoremTwo Views of Bayes TheoremChain Rule
2 MAP EstimateIntroductionRidge RegressionLogistic RegressionLog Linear ModelLoss FunctionGaussian Process
3 SummaryMAP Estimation SummaryFurther and Other Topics
4 BibliographyBibliography
kzky MAP Estimate and its Periphery 2011/4/24 18 / 22
Bayes Theorem MAP Estimate Summary Bibliography
MAP Estimation Summary
MAP Estimation Summary
Good things of MAP Estimate are:
able to find Global Minima if we choose convex loss function
easy to understand and cast other interpretation to SVM
some models (e.g., GP) are solvable analytically
expandability:
1 we can change p(y|x,θ) into various distributions2 easy to convert supervised model into SSL using p(x|θ) term3 modifiability to sequantial labeling
(e.g., log linear model to Conditional Random Field)
*GP for ML is freely downloadable fromhttp://www.gaussianprocess.org/gpml/chapters/
kzky MAP Estimate and its Periphery 2011/4/24 19 / 22
Bayes Theorem MAP Estimate Summary Bibliography
MAP Estimation Summary
MAP Estimation Summary
Good things of MAP Estimate are:
able to find Global Minima if we choose convex loss function
easy to understand and cast other interpretation to SVM
some models (e.g., GP) are solvable analytically
expandability:
1 we can change p(y|x,θ) into various distributions2 easy to convert supervised model into SSL using p(x|θ) term3 modifiability to sequantial labeling
(e.g., log linear model to Conditional Random Field)
*GP for ML is freely downloadable fromhttp://www.gaussianprocess.org/gpml/chapters/
kzky MAP Estimate and its Periphery 2011/4/24 19 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Further and Other Topics
Further and Other Topics
Relationships
1 Bayse Estimation: find a function of θ (but no guarantee for globalsolution)
2 Maximum (Log) Likelihood (e.g., EM for GMM and HMM)3 Naive Bayes: p(θ) ∼ Dirichlet and p(y|x,θ) ∼ multinominal
SSlize (expansion of MAP Estimate Case)
1 Entropy Regularization to Logistic Regression (nips 2005)2 Null Categorial Noise Model to Gaussian Process (nips 2005)
kzky MAP Estimate and its Periphery 2011/4/24 20 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Outline
1 Bayes TheoremTwo Views of Bayes TheoremChain Rule
2 MAP EstimateIntroductionRidge RegressionLogistic RegressionLog Linear ModelLoss FunctionGaussian Process
3 SummaryMAP Estimation SummaryFurther and Other Topics
4 BibliographyBibliography
kzky MAP Estimate and its Periphery 2011/4/24 21 / 22
Bayes Theorem MAP Estimate Summary Bibliography
Bibliography
Bibliography
1 S.Akaho, “Kernel Maltiple Analysis”, Iwanami 2009
2 D.Takamura and M.Okumura, “Introductino to Machine Learningfor Natural Language Processing”, Corona 2010
3 X.Zhu, “Introduction to Semi-Supervised Learning”, Morgan &Claypool Publishers 2009
4 X.Zhu, “Semi-Supervised Learning Literature Survey”, 2008
5 C.Rasmussen and C.Williams, “Gaussian Process for MachineLearning”, the MIT Press 2006
kzky MAP Estimate and its Periphery 2011/4/24 22 / 22