1 chapter 8: model inference and averaging presented by hui fang
DESCRIPTION
3 Parametric model –A set that can be parameterized by a finite number of parameters –E.g. Assume the data come from a normal distribution, the model is –A parametric model takes the form Statistical Model(1)TRANSCRIPT
1
Chapter 8: Model Inference and Averaging
Presented by Hui Fang
2
Basic Concepts• Statistical inference
– Using data to infer the distribution that generated the data
• We observe .• We want to infer (or estimate or learn) F or some
feature of F such as its mean.
• Statistical model– A set of distributions ( or a set of densities)
• Parametric model• Non parametric model
FXX n ~,...,1
3
• Parametric model– A set that can be parameterized by a finite number of
parameters
– E.g. Assume the data come from a normal distribution, the model is
– A parametric model takes the form
Statistical Model(1)
}0,},)(2
1exp{1),;({ 22
Rxxf
}:);({ xf
4
• Non-parametric model
– A set that cannot be parameterized by a finite number of parameters
– E.g. Assume the data comes from
Statistical Model(2)
}'_{' sCDFall
Probability density function, PDF, f(x):
Cumulative density function,CDF, F(x):
b
adxxfbXaP )()(
x
dssfxXPxF0
)()()(
5
Outline
• Model Inference– Maximum likelihood inference (8.2.2)
• EM Algorithm (8.5)– Bayesian inference (8.3)
• Gibbs Sampling (8.6)– Bootstrap (8.2.1,8.2.3,8.4)
• Model Averaging and improvement– Bagging (8.7)– Bumping (8.9)
6
Parametric Inference
• Parametric models:
• The problem of inferenceproblem of estimating the parameter
• Method– Maximum Likelihood Inference– Bayesian Inference
}:);({ xf
7
An Example of MLE
Suppose you have ),(~,...,, 221 Nxxx n
But you don’t know or 2
MLE: For which is most likely?),( 2 nxxx ,...,, 21
n
iin xnxxxP
1
22
2221 )(
21)log
21(log),|,...,,(log
n
iixLL
12 )(1
n
iixnLL
1
2422 )(
21
2
0
0
n
iimle x
n 1
1
n
imleimle x
n 1
22 )(1
8
A General MLE strategy
Suppose is a vector of parameters.Tn ),...,,( 21
Task: Find MLE for )|,...,();( 1 nxxPXL
2. Work out using high-school calculusLL
));(log( XLLL 1. Write
,0,...,0,021
n
LLLLLL
3. Solve the set of simultaneous equations
4. Check you are at a maximum
Likelihood functionLog-likelihood function
Maximum likelihood estimator:Maximizes Likelihood function
9
Properties of MLE(?)
• Sampling distributions of the maximum likelihood estimator has a limiting normal distribution.(P230)
))(,(ˆ 100
iN
)]([)( IEi Fisher information
is true value of 0
Information matrix )(I
10
where with
An Example for EM Algorithm(1)
• Model Y as a mixture of two normal distribution
),(~ 2111 NY ),(~ 2
222 NY
21)1( YYY
}1,0{ )1(P
sum of terms is inside the logarithm=>difficult to maximize it
The parameters are
The log-likelihood based on the N training cases is
),,,,(),,( 222
21111
N
iii yyZl
1
)]()()1log[();(21
11
An Example for EM Algorithm(2)Consider unobserved latent variables :
comes from model 2; otherwise from model 1.ii Y~1
i
N
iii
N
iiiii yyZl
11
]log)1log()1[()](log)(log)1[();(21
If we knew the values of i
1. Take initial guesses for the parameters
2. Expectation Step: compute
3. Maximization Step: compute the values for the parameters which can maximize the log-likelihood given
4. Iterate steps 2 and 3 until convergence.
ˆ,ˆ,ˆ,ˆ,ˆ 222
211
)(ˆ)()ˆ1(
)(ˆ),ˆ|1Pr(),ˆ|(ˆ
21
2
ˆˆ
ˆ
ii
iiii yy
yZZE
Ni ,...,2,1
ˆ,ˆ,ˆ,ˆ,ˆ 222
211
12
An Example for EM Algorithm(3)
13
Bayesian Inference• Prior (knowledge before we see the data):• Sampling model: • After observing data Z, we update our beliefs and
form the posterior distribution
)Pr(
)|Pr( Z
dLL
dZZZ
n
n
)Pr()()Pr()(
)Pr()|Pr()Pr()|Pr()|Pr( )Pr()( nL
Posterior is proportional to likelihood times prior!
Doesn’t it cause a problem to throw away the constant?
We can always recover it, since 1)|Pr( dZ
14
• Task: predict the values of a future observation• Bayesian approach
• Maximum likelihood approach
Prediction using inference
dZzZz newnew )|Pr()|Pr()|Pr(
newz
)ˆ|Pr( newz
15
MCMC(1)
General Problem: evaluating
dhhE )()()]([
)|Pr()( Z
can be difficult.
where
However, if we can draw samples
)(~,...,, )()2()1( N
then we can estimate
N
t
tN Xh
NhhE
1
)( )(1)]([
This is Monte Carlo (MC) integration.
16
MCMC(2) ?
• A stochastic process is an indexed random variablewhere t maybe time and X is a random variable.
• A Markov chain is generated by sampling
So, depends only on ,not on
,...2,1),|(~ )()1( tXxpX tt
p is the transition kernel.
)(tX
)1( tX )(tX )1()1()0( ,...,, tXXX
As , the Markov chain converges to its stationary distribution.
t
17
MCMC(3)
• Problem:How do we construct a Markov chain whose stationary
distribution is our target distribution, ?)(
This is called Markov chain Monte Carlo (MCMC)
Two key objectives:
1. Generate a sample from a joint probability distribution
2. Estimate expectations using generated sample averages ( I.e. doing MC integration)
),...,()( 1 k
18
Gibbs Sampling(1)
• Purpose: Draw from a Joint Distribution
• Method: Iterative Conditional Sampling
);,...,( 1 k target )(
,i );|(~ ][ iii Draw
19
Gibbs Sampling(2)
• Suppose that
• Sample or update in turn:
),...,,|(~ )()(3
)(21
)1(1
tk
ttt
),...,,|(~ )()(3
)1(12
)1(2
tk
ttt
……
)...,,|(~ )1(1
)1(2
)1(1
)1(
tk
ttk
tk
Always use the most recent values
),...,,( 21 k
20
An Example for Conditional Sampling
• Target distribution:
10;,...,1,0,)1(),( 11
ynxyy
xn
yxf xnx
• How to draw samples?
),()|(~ ynBinomyxfx
),()|(~ xnxBetaxyfy
21
Recall: Same Example for EM (1)
• Model Y as a mixture of two normal distribution
where with
),(~ 2111 NY ),(~ 2
222 NY
21)1( YYY
}1,0{ )1(P
For simplicity, assume the parameters are ),( 21
22
1. Take initial guesses for the parameters
2. Repeat for t=1.2.,….
(a) For i=1,2,…,N generate with
(b) Generate
3. Continue step 2 until the joint distribution of doesn’t change
Comparison between EM and Gibbs Sampling
1. Take initial guesses for the parameters
2. Expectation Step: compute
3. Maximization Step: compute the values for the parameters which can maximize the log-likelihood given
4. Iterate steps 2 and 3 until convergence.
EM ˆ,ˆ,ˆ,ˆ,ˆ 2
22211
)(ˆ)()ˆ1(
)(ˆ),ˆ|1Pr(),ˆ|(ˆ
21
2
ˆˆ
ˆ
ii
iiii yy
yZZE
Ni ,...,2,1
ˆ,ˆ,ˆ,ˆ,ˆ 222
211
},{ )0(2
)0(1
)0(
}1,0{)( ti
)(ˆ)()ˆ1(
)(ˆ),ˆ|1Pr(
)1(2
)1(1
)1(2
ii
ii yy
yZ
tt
t
)ˆ,ˆ(~),ˆ,ˆ(~ 222
)(2
211
)(1 NN tt
),,( )(2
)(1
)( ttt
Gibbs
23
Bootstrap(0)
• Basic idea:– Randomly draw datasets with replacement from the
training data– Each sample has the same size as the original training
set
),...,( 1 nxxX
1*X 1*X 1*X……
Training sample
Bootstrap samples
24
Example for Bootstrap(1)
Y Z
)()(
ZEYE
F
F
bioequivalence
25
Example for Bootstrap(2)
)()(
ZEYE
F
FWe want to estimate
The estimator is : 0713.0ˆ ZY
What is the accuracy of the estimator?
26
Bootstrap(1)
• The bootstrap was introduced as a general method for assessing the statistical accuracy of an estimator.
• Data: • Statistic(any function of the data):• We want to know
FXX n ~,...,1
),...,( 1 nn XXgT )( nF TV
Real worldBootstrap world
),...,(,..., 11 nnn XXgTXXF
),...,(,...,ˆ **1
***1 nnn XXgTXXF
can be estimated with ? )( *ˆ nF TV)( nF TV
27
• Suppose we draw a sample from a distribution .
Bootstrap(2)---Detour
BYY ,...,1
)()(11
YEyydFYB
YB
jjn
B
)())(()( 22 YVyydFydFy
B
j
B
jj
B
jjj Y
BY
BYY
B 1
2
11
22 )1(1)(1
F
28
Bootstrap(3)
• Real world• Bootstrap world
),...,(,..., 11 nnn XXgTXXF
),...,(,...,ˆ **1
***1 nnn XXgTXXF
Bootstrap Variance Estimation
1. Draw
2. Compute
3. Repeat steps 1 and 2, B times, to get
4. Let
nn FXX ˆ~,..., **1
),...,( **1
*nn XXgT
*,
*1, ,..., Bnn TT
B
b
B
rrnbnboot T
BT
Bv
1
2
1
*,
*, )1(1
bootnFnF vTVTV )()( *ˆ
29
Bootstrap(4)• Non-parametric Bootstrap
– Uses the raw data, not a specific parametric model, to generate new datasets
• Parametric Bootstrap– Simulate new responses by adding Gaussian noise to
the predicted values– Example from the book…
• ---estimate• We simulate new (x,y) by
)ˆ,0(~;)(ˆ 2*** Nxy iiii
)()( xhbx ii )(ˆ x
30
Bootstrap(5)---Summary
• Nonparametric bootstrap– No underlying distribution assumption
• Parametric bootstrap agrees with maximum likelihood
• Bootstrap distribution approximates posterior distribution of parameters with non-informative priors (?)
31
Bagging(1)• Bootstrap:
– A way of assessing the accuracy of a parameter estimate or a prediction
• Bagging (Bootstrap Aggregating)– Use bootstrap samples to predict data classifiers
B
b
bbag xf
Bxf
1
* )(ˆ1)(ˆ
Classification becomes majority voting
),...,( 1 nxxX
1*X 1*X 1*X……
Original sample
Bootstrap sample
Bootstrap estimators
)(ˆ 1* xf )(ˆ 2* xf )(ˆ * xf B
32
Bagging(2)• Pros
– The estimator can be significantly improved if the learning algorithm is unstable.
• Some change to training set causes large change in output hypothesis
– Reduce the variance, bias unchanged• Cons
– Degrade the performance of stable procedures ???– Lose the structure after bagging
33
Bumping• A stochastic flavor of model selection
– Bootstrap Umbrella of Model Parameters– Sample data set, train it, until we are satisfied or tired
N
ii
bi
bxfyb
1
2* )](ˆ[minargˆ
),...,( 1 nxxX
1*X 1*X 1*X……
Original sample
Bootstrap sample
Bootstrap estimators
)(ˆ 1* xf )(ˆ 2* xf )(ˆ * xf B
Compare different models on the training data
34
Conclusions
• Maximum Likelihood vs. Bayesian Inference• EM vs. Gibbs Sampling• Bootstrap
– Bagging– Bumping