1 chapter 8: model inference and averaging presented by hui fang

1

Chapter 8: Model Inference and Averaging

Presented by Hui Fang

2

Basic Concepts• Statistical inference

– Using data to infer the distribution that generated the data

• We observe .• We want to infer (or estimate or learn) F or some

feature of F such as its mean.

• Statistical model– A set of distributions ( or a set of densities)

• Parametric model• Non parametric model

FXX n ~,...,1

3

• Parametric model– A set that can be parameterized by a finite number of

parameters

– E.g. Assume the data come from a normal distribution, the model is

– A parametric model takes the form

Statistical Model(1)

}0,},)(2

1exp{1),;({ 22

Rxxf

}:);({ xf

4

• Non-parametric model

– A set that cannot be parameterized by a finite number of parameters

– E.g. Assume the data comes from

Statistical Model(2)

}'_{' sCDFall

Probability density function, PDF, f(x):

Cumulative density function,CDF, F(x):

b

adxxfbXaP )()(

x

dssfxXPxF0

)()()(

5

Outline

• Model Inference– Maximum likelihood inference (8.2.2)

• EM Algorithm (8.5)– Bayesian inference (8.3)

• Gibbs Sampling (8.6)– Bootstrap (8.2.1,8.2.3,8.4)

• Model Averaging and improvement– Bagging (8.7)– Bumping (8.9)

6

Parametric Inference

• Parametric models:

• The problem of inferenceproblem of estimating the parameter

• Method– Maximum Likelihood Inference– Bayesian Inference

}:);({ xf

7

An Example of MLE

Suppose you have ),(~,...,, 221 Nxxx n

But you don’t know or 2

MLE: For which is most likely?),( 2 nxxx ,...,, 21

n

iin xnxxxP

1

22

2221 )(

21)log

21(log),|,...,,(log

n

iixLL

12 )(1

n

iixnLL

1

2422 )(

21

2

0

0

n

iimle x

n 1

1

n

imleimle x

n 1

22 )(1

8

A General MLE strategy

Suppose is a vector of parameters.Tn ),...,,( 21

Task: Find MLE for )|,...,();( 1 nxxPXL

2. Work out using high-school calculusLL

));(log( XLLL 1. Write

,0,...,0,021

n

LLLLLL

3. Solve the set of simultaneous equations

4. Check you are at a maximum

Likelihood functionLog-likelihood function

Maximum likelihood estimator:Maximizes Likelihood function

9

Properties of MLE(?)

• Sampling distributions of the maximum likelihood estimator has a limiting normal distribution.(P230)

))(,(ˆ 100

iN

)]([)( IEi Fisher information

is true value of 0

Information matrix )(I

10

where with

An Example for EM Algorithm(1)

• Model Y as a mixture of two normal distribution

),(~ 2111 NY ),(~ 2

222 NY

21)1( YYY

}1,0{ )1(P

sum of terms is inside the logarithm=>difficult to maximize it

The parameters are

The log-likelihood based on the N training cases is

),,,,(),,( 222

21111

N

iii yyZl

1

)]()()1log[();(21

11

An Example for EM Algorithm(2)Consider unobserved latent variables :

comes from model 2; otherwise from model 1.ii Y~1

i

N

iii

N

iiiii yyZl

11

]log)1log()1[()](log)(log)1[();(21

If we knew the values of i

1. Take initial guesses for the parameters

2. Expectation Step: compute

3. Maximization Step: compute the values for the parameters which can maximize the log-likelihood given

4. Iterate steps 2 and 3 until convergence.

ˆ,ˆ,ˆ,ˆ,ˆ 222

211

)(ˆ)()ˆ1(

)(ˆ),ˆ|1Pr(),ˆ|(ˆ

21

2

ˆˆ

ˆ

ii

iiii yy

yZZE

Ni ,...,2,1

ˆ,ˆ,ˆ,ˆ,ˆ 222

211

12

An Example for EM Algorithm(3)

13

Bayesian Inference• Prior (knowledge before we see the data):• Sampling model: • After observing data Z, we update our beliefs and

form the posterior distribution

)Pr(

)|Pr( Z

dLL

dZZZ

n

n

)Pr()()Pr()(

)Pr()|Pr()Pr()|Pr()|Pr( )Pr()( nL

Posterior is proportional to likelihood times prior!

Doesn’t it cause a problem to throw away the constant?

We can always recover it, since 1)|Pr( dZ

14

• Task: predict the values of a future observation• Bayesian approach

• Maximum likelihood approach

Prediction using inference

dZzZz newnew )|Pr()|Pr()|Pr(

newz

)ˆ|Pr( newz

15

MCMC(1)

General Problem: evaluating

dhhE )()()]([

)|Pr()( Z

can be difficult.

where

However, if we can draw samples

)(~,...,, )()2()1( N

then we can estimate

N

t

tN Xh

NhhE

1

)( )(1)]([

This is Monte Carlo (MC) integration.

16

MCMC(2) ?

• A stochastic process is an indexed random variablewhere t maybe time and X is a random variable.

• A Markov chain is generated by sampling

So, depends only on ,not on

,...2,1),|(~ )()1( tXxpX tt

p is the transition kernel.

)(tX

)1( tX )(tX )1()1()0( ,...,, tXXX

As , the Markov chain converges to its stationary distribution.

t

17

MCMC(3)

• Problem:How do we construct a Markov chain whose stationary

distribution is our target distribution, ?)(

This is called Markov chain Monte Carlo (MCMC)

Two key objectives:

1. Generate a sample from a joint probability distribution

2. Estimate expectations using generated sample averages ( I.e. doing MC integration)

),...,()( 1 k

18

Gibbs Sampling(1)

• Purpose: Draw from a Joint Distribution

• Method: Iterative Conditional Sampling

);,...,( 1 k target )(

,i );|(~ ][ iii Draw

19

Gibbs Sampling(2)

• Suppose that

• Sample or update in turn:

),...,,|(~ )()(3

)(21

)1(1

tk

ttt

),...,,|(~ )()(3

)1(12

)1(2

tk

ttt

……

)...,,|(~ )1(1

)1(2

)1(1

)1(

tk

ttk

tk

Always use the most recent values

),...,,( 21 k

20

An Example for Conditional Sampling

• Target distribution:

10;,...,1,0,)1(),( 11

ynxyy

xn

yxf xnx

• How to draw samples?

),()|(~ ynBinomyxfx

),()|(~ xnxBetaxyfy

21

Recall: Same Example for EM (1)

• Model Y as a mixture of two normal distribution

where with

),(~ 2111 NY ),(~ 2

222 NY

21)1( YYY

}1,0{ )1(P

For simplicity, assume the parameters are ),( 21

22


2. Repeat for t=1.2.,….

(a) For i=1,2,…,N generate with

(b) Generate

3. Continue step 2 until the joint distribution of doesn’t change

Comparison between EM and Gibbs Sampling


2. Expectation Step: compute

3. Maximization Step: compute the values for the parameters which can maximize the log-likelihood given

4. Iterate steps 2 and 3 until convergence.

EM ˆ,ˆ,ˆ,ˆ,ˆ 2

22211

)(ˆ)()ˆ1(

)(ˆ),ˆ|1Pr(),ˆ|(ˆ

21

2

ˆˆ

ˆ

ii

iiii yy

yZZE

Ni ,...,2,1

ˆ,ˆ,ˆ,ˆ,ˆ 222

211

},{ )0(2

)0(1

)0(

}1,0{)( ti

)(ˆ)()ˆ1(

)(ˆ),ˆ|1Pr(

)1(2

)1(1

)1(2

ii

ii yy

yZ

tt

t

)ˆ,ˆ(~),ˆ,ˆ(~ 222

)(2

211

)(1 NN tt

),,( )(2

)(1

)( ttt

Gibbs

23

Bootstrap(0)

• Basic idea:– Randomly draw datasets with replacement from the

training data– Each sample has the same size as the original training

set

),...,( 1 nxxX

1*X 1*X 1*X……

Training sample

Bootstrap samples

24

Example for Bootstrap(1)

Y Z

)()(

ZEYE

F

F

bioequivalence

25

Example for Bootstrap(2)

)()(

ZEYE

F

FWe want to estimate

The estimator is : 0713.0ˆ ZY

What is the accuracy of the estimator?

26

Bootstrap(1)

• The bootstrap was introduced as a general method for assessing the statistical accuracy of an estimator.

• Data: • Statistic(any function of the data):• We want to know

FXX n ~,...,1

),...,( 1 nn XXgT )( nF TV

Real worldBootstrap world

),...,(,..., 11 nnn XXgTXXF

),...,(,...,ˆ **1

***1 nnn XXgTXXF

can be estimated with ? )( *ˆ nF TV)( nF TV

27

• Suppose we draw a sample from a distribution .

Bootstrap(2)---Detour

BYY ,...,1

)()(11

YEyydFYB

YB

jjn

B

)())(()( 22 YVyydFydFy

B

j

B

jj

B

jjj Y

BY

BYY

B 1

2

11

22 )1(1)(1

F

28

Bootstrap(3)

• Real world• Bootstrap world

),...,(,..., 11 nnn XXgTXXF

),...,(,...,ˆ **1

***1 nnn XXgTXXF

Bootstrap Variance Estimation

1. Draw

2. Compute

3. Repeat steps 1 and 2, B times, to get

4. Let

nn FXX ˆ~,..., **1

),...,( **1

*nn XXgT

*,

*1, ,..., Bnn TT

B

b

B

rrnbnboot T

BT

Bv

1

2

1

*,

*, )1(1

bootnFnF vTVTV )()( *ˆ

29

Bootstrap(4)• Non-parametric Bootstrap

– Uses the raw data, not a specific parametric model, to generate new datasets

• Parametric Bootstrap– Simulate new responses by adding Gaussian noise to

the predicted values– Example from the book…

• ---estimate• We simulate new (x,y) by

)ˆ,0(~;)(ˆ 2*** Nxy iiii

)()( xhbx ii )(ˆ x

30

Bootstrap(5)---Summary

• Nonparametric bootstrap– No underlying distribution assumption

• Parametric bootstrap agrees with maximum likelihood

• Bootstrap distribution approximates posterior distribution of parameters with non-informative priors (?)

31

Bagging(1)• Bootstrap:

– A way of assessing the accuracy of a parameter estimate or a prediction

• Bagging (Bootstrap Aggregating)– Use bootstrap samples to predict data classifiers

B

b

bbag xf

Bxf

1

* )(ˆ1)(ˆ

Classification becomes majority voting

),...,( 1 nxxX

1*X 1*X 1*X……

Original sample

Bootstrap sample

Bootstrap estimators

)(ˆ 1* xf )(ˆ 2* xf )(ˆ * xf B

32

Bagging(2)• Pros

– The estimator can be significantly improved if the learning algorithm is unstable.

• Some change to training set causes large change in output hypothesis

– Reduce the variance, bias unchanged• Cons

– Degrade the performance of stable procedures ???– Lose the structure after bagging

33

Bumping• A stochastic flavor of model selection

– Bootstrap Umbrella of Model Parameters– Sample data set, train it, until we are satisfied or tired

N

ii

bi

bxfyb

1

2* )](ˆ[minargˆ

),...,( 1 nxxX

1*X 1*X 1*X……

Original sample

Bootstrap sample

Bootstrap estimators

)(ˆ 1* xf )(ˆ 2* xf )(ˆ * xf B

Compare different models on the training data

34

Conclusions

• Maximum Likelihood vs. Bayesian Inference• EM vs. Gibbs Sampling• Bootstrap

– Bagging– Bumping

1 chapter 8: model inference and averaging presented by hui fang

Documents