1 bayesian discrete choice. - university of...

1 Bayesian Discrete Choice.

1. Quick Review of Bayes

2. Bayesian analysis of linear regression

3. Bayesian MNP

4. Target Marketing

2 Quick Review of Bayes.

• See Cameron and Trivedi for an overview of Bayes.

• In Bayesian econometrics, the econometrician actsas a rational decision maker, just like the agents

in economic theory.

• The econometrician starts off with a prior distri-bution p(θ) about the model parameters.

• The econometrician observes some data y = [y1, ..., yn]

• The econometrician has a model, f(y|θ) which isthe probability of observing y conditional on the

parameters θ.

• The econometrician’s posterior probability by BayesTheorem is:

p(θ|y) = p(θ)f(y|θ)Rp(θ)f(y|θ)dθ

• In the last decade, there has been an explosion inthe applications of Bayesian approaches in statis-

tics and econometrics.

• In Markov chain monte carlo, the econometriciansimulates the poterior distribution p(θ|y).

• This involves simulating a markov chain wherethe invariant (or long run) distribution is exactly

equal to the posterior.

• The output of this simulation is a sequence ofpseudo random numbers θ(1), ..., θ(S).

• Let f(θ(1), ..., θ(S)) denote the density that putsweight 1S on each simulation draw.

• Then under suitably regularity conditions

f(θ(1), ..., θ(S))→d p(θ|y)

• This posterior distribuiton expresses the econome-tricians beliefs about the parameters after seeing

the data.

• We could use the posterior to:

1. Construct a 95 percent credible set (i.e. a set

that has 95 percent posterior probability). This

is analogous to a confidence interval.

2. Simulate the distribution of functions of the pa-

rameters, g(θ) as 1SPs g(θ

(s)).

3. Construct a predictive distribution for forecasting,

i.e. simulate the model for each pseudo-random

draw θ(1), ..., θ(S)

• There are a few advantages to the Bayesian ap-proach.

• The first is that it is very elegant numerically.

• There are some problems in latent variable, timeseries and other models that can only be solved

through the use of Bayes.

• The second is that Bayes is exact in finite samples(up to our ability to approximate the posterior).

• Asymptotic theory depended on first order Taylorseries expansion.

• Essentially, you are hoping that the first and sec-ond derivatives capture the behavior of the func-

tion.

• This can be a very poor approximation in finitesamples in some cases.

• In Bayes, such linearizations are not required.

• However, you do need to be able to simulate theposterior accurately.

• Third, Bayes fits into decision theory and can helpto guide rational decision making.

• For example, Rossi et. al. (1997) study the prob-lem of target marketing.

• You see each household in a scanner panel dataset choose a handful of times.

• You can use Bayes’ theorem to update on the

random coefficients of each household i.

• e.g. if there are 10,000 housholds, you have a

posterior for the preferences parameters for each

of the 10,000 household conditional on its pur-

chase history.

• Using this posterior, you can form your posterior

beliefs about the profits from sending a coupon

to an individual household.

• Fourth, in Bayes you can form posteriors over

models.

• Suppose that you have m = 1, ...,M probabilitymodels, fm(y|θ).

• Then f(y|θ) =MX

m=1

pmfm(y|θ) where pm is the

probability of model m.

• You can use Bayes theorem to express your pos-terior probability pm about model m.

• This is a very elegant way to handle non-nestedmodel and might be superior to classical approachesto non-nested testing.

• Finally, in Bayesian econometrics, you can workwith models that are not identified or that do notexhibit normal asymptotics.

• A flat likelihood does not affect your ability toconstruct a posterior.

3 Markov Chains.

• Two common ways to conduct MCMC are Gibbssampling and Metropolis.

• A normal random walk metropolis works as fol-

lows.

• First, the econometrician comes up with a roughguess θ0 at the MLE.

• Second, come up with a rough guess at I0 at theinformation matrix using the hessian of the MLE.

• A sequence of psueorandom values θ(1), ..., θ(S)

is drawn as follows. Given θ(s), we draw θ(s+1)

as follows:

1. First, draw a candidate value eθ ∼ N(θ(s), I0)

2. Second, compute α = min{ p(eθ)f(y|eθ)p(θ(s))f(y|θ(s))

, 1}

3. Set θ(s+1) = eθ with probability α and θ(s+1) =

θ(s) with probability α.

• Implimenting this algorithm simply requires the

econometrician to evaluate the likihood repeat-

edly and draw normal deviates.

• A second algorithm for constructing a Markov

Chain is Gibbs sampling.

• Partition parameters into θ1, ..., θd blocks

• Let pk(θk|θ1, ..., θk−1, θk+1, ..., θd) denote the con-ditional distribution of the kth block of parame-

ters given the others.

• In some applications, this distribution can be con-venient to form even if the entire likelihood is

quite complicated!

• Starting with an initial value θ0, Gibbs samplingworks as follows. Given θ(s)

1. Draw θ(s+1)1 ∼ p1(θ1|θ

(s)2 , θ

(s)2 ..., θ

(s)d )

2. Draw θ(s+1)2 ∼ p2(θ1|θ

(s+1)1 , θ

(s)3 , θ

(s)4 ..., θ

(s)d )

3. Draw θ(s+1)3 ∼ p3(θ1|θ

(s+1)1 , θ

(s+1)2 , θ

(s)4 ..., θ

(s)d )

...

d. Draw θ(s+1)d ∼ pd(θ1|θ

(s+1)1 , θ

(s+1)2 , θ

(s+1)4 ..., θ

(s+1)d−1 )

d+1 Return to 1.

4 Bayesian Analysis of Regression

• MCMC/Gibbs sampling are particularly powerfulin problems with latent variables.

• In discrete choice, utiltity is latent to the econo-metrician.

• In a multinomial probit, if utility was observed bythe econometrician, estimating parameters would

boil down to linear regression.

• For our analysis, it will be useful to consider theBayesian analysis of linear regression.

• A key step in Rossi et. al.’s paper essentially in-volve the Bayesian analysis of a normal linear re-

gression.

• The analysis here follows Geweke’s textbook, Con-temporary Bayesian Econometrics (an excellent

intro to the subject).

• y is a T × 1 vector of dependent variables

• X is a T ×k matrix of covariates (nonstochastic)

• Assume that error terms are normal, homoskedas-tic.

• y|β, h,X ∼ N(Xβ, h−1IT )

p(y|β, h,X) = (2π)−T/2hT/2 exp(−h(y−Xβ)0(y−Xβ)/2))

• h is called the precision parameter (inverse of vari-

ance)

• IT is a scalar covariance matrix

• This is our likelihood function.

• Recall that Bayes theorem implies that the pos-

terior distribution of the model parametes is the

prior times the likelihood.

• We need to specify a prior distribution for ourmodel parameters.

• p(β) = N(β,H−1)

p(β) = (2π)−k/2|H|1/2 exp(−(β − β)0H−1(β − β)))

• The prior on beta is normal with prior mean, βand prior precision H

• The prior distribution on h is s2h ∼ χ2(v)

p(h) =h2v/2Γ(v/2)

i−1(s)v/2h(v−2)/2 exp(−s2h/2)

• Remark- this is essentially a gamma distribution,rewritten in a manner that will be convenient for

reasons below.

• This form for the prior is chosen because of conju-gacy, i.e. the posterior distribution can be written

in an analytically convenient manner.

• Now recall tht the posterior is proportional to theprior time the likelihood.

• That is, p(β, h|X, y) ∝ p(β)p(h)p(y|β, h,X)

• Combining the equations above yields p(β, h|X, y) ∝:

(2π)−T/2hT/2h2v/2Γ(v/2)

i−1 |H|1/2(s)v/2h(T+v−2)/2 exp(−s2h/2)

exp

"−(β − β)0H−1(β − β)/2−h(y −Xβ)0(y −Xβ)/2

#

• Now recall from Cameron and Trivedi, the idea

behind Gibbs sampling is to ”block” the parame-

ters into a set of convenient conditional distribu-

tions.

• In this case, we will want to block p(β|h,X, y)

and p(h|β,X, y)

• Let’s first derive p(β|h,X, y).

• It is obvious from the agove expression, that β isgoing to be normally distributed.

• We will want to complete the the square insidethe exp() to rewrite the expression in the form(β − eβ)0fH−1(β − eβ)

• Then eβ will be the posterior mean and fH theposterior precision

• Distributing terms and completing the square inβ yields that:

(β − β)0H−1(β − β) + h(y −Xβ)0(y −Xβ) =

(β − β)0H−1(β − β) + h(y0 −Xβ)0(y0 −Xβ)

where y0 = fitted ols values

= (β − eβ)0fH−1(β − eβ) +QfH = H + hX 0Xeβ = fH−1(H−1β + hX 0Xb)

where b is the ols estimate of β

Q is a constant that does not depend on β

• Note that the posterior precision is a weightedaverage of the prior precison and the ols estimate

of the precision

• The posterior estimate of eβ involves a weightedaverage of the prior mean and the ols estimate.

• The weights depend on the posterior precision,the prior precision and the ols estimate of the

precision.

• As the sample size becomes sufficiently large, thedata will ”swamp” the prior.

• The number of terms in the likelihood is a func-tion of T and grows with the sample size

• The number of terms in the prior remains fixed.

• Next, we have to derive the posterior in h

• If you look at the prior times the likelihood, itis obvious that the posterior will be of the form

p(h|β,X, y) of hα exp(−hω)

• If we can derive α and ω we can express the pos-terior

• By some straightforward, albeit tedious, algebrawe can write the posterior distribution as p(h|β,X, y):

es2h ∼ χ2(ev)es = s+ (y0 −Xβ)0(y0 −Xβ)ev = v + T

• A Gibbs sampler generates a pseudo-random se-

quence (h(s), β(s))s = 1, ..., S using the following

markov chain,

1. Given (h(s), β(s)), draw β(s+1) ∼ p(β|h(s),X, y)

2. Given β(s+1) draw h(s+1) ∼ p(h|β(s+1),X, y)

3. Return to 1

• This example illustrates the importance of conju-gacy.

• That is, choosing our prior and likelihoods to bethe ”right” functional form greatly simplifies the

analysis of the posterior distribution.

• Standard texts in Bayesian statistics (e.g. Berger/Bernardoand Smith) have appendices that lay out conju-

gate distributions.

5 Bayesian Multinomial Probit

• The multinomial probit is closely related to theanalysis of the normal linear model.

• The multinomial probit is defined as:

yij = xijβ + εij (1)

var(εi1, ..., εiJ) = h−1IJcij = 1{yij > yij0 for j

0 6= j}

• In the above, yij is the utility of person i for

alternative j

• εij is the stochastic preference shock

• xij are covariates that enter into i’s utility

• cij = 1 if i chooses j

• If the yij were known, then we could use the Gibbssampler above to estimate β and h

• However, the yij are latent variables and thereforewe do data augmentation.

• The idea behind data augmentation is simple— weintegrate out the distribution of the variables that

we do not see.

• Follwing the notation in Cameron and Trivedi, letf(θ|y, y∗) denote the posterior conditional on theobserved variables, y and the latent variables, y∗.

• Let f(y∗|y, θ) denote the distribution of the latentvariable conditional on y and parameters.

• Then the posterior can be written as:

p(θ|y) =Zf(θ|y, y∗)f(y∗|y, θ)dy∗

• Taking account of the latent variable simply in-volves an additional Gibbs step.

• The distribution of the latent utility yijis a tru-

cated normal distribution.

• If cij = 1, yij is a truncated normal with mean pa-rameter β, precision h and lower truncation point

max{yij0, j0 6= j}.

• If cij = 0, yij is a truncated normal with mean

parameter β, precision h and upper truncation

point max{yij}.

• The Gibbs sampler for the multinomial probit sim-ply adds the data augmentation step above:

• A Gibbs sampler generates a pseudo-random se-

quence (h(s), β(s),½y(s)ij

¾i∈I,j∈J

) s = 1, ..., S us-

ing the following markov chain

1. Given (h(s), β(s)), draw β(s+1) ∼ p(β|h(s),X, y, C)

2. Given β(s+1) draw h(s+1) ∼ p(h|β(s+1),X, y, C)

3. For each I, draw y(s+1)i1 ∼ p(h|β(s+1),X, y

(s)i2 , ..., y

(s)iJ , C)

4. Draw y(s+1)i2 ∼ p(h|β(s+1),X, y

(s+1)i1 , ..., y

(s)iJ , C)

5. ...

6. Draw y(s+1)iJ ∼ p(h|β(s+1),X, y

(s+1)i1 , ..., y

(s+1)iJ−1 , C)

7. Return to 1

6 Target Marketing

• In ”The Value of Purchase History Data in TargetMarketing” Rossi et. al. attempt to estimate

household level preference parameters.

• This is of interest as a marketing problem.

• CMI checkout coupon uses purchase informationto customize coupons to a particular household.

• In principal, the entire purchase history (from con-sumer loyalty cards) could be used to customize

coupons (and hence prices)

• If a household level preference parameter can beforecasted with high precision, this is essentially

first degree price discrimination!

• Even with short purchase histories, they find thatprofits are increased 2.5 fold through the use of

purchase data compared to blanket couponing strate-

gies.

• Even one observation can boost profits from coupon-ing by 50%.

• This application is of interest to economists aswell.

• The methods in this paper allow us to account forconsumer heterogeneity in a very rich manner.

• This might be useful to examine the distribtionof welfare consequences of a policy intervention

(e.g. a merger or market regulation).

• Beyond that, these methods demonstrate the powerof Bayesian methods in latent variable problems.

7 Random Coefficients Model

• Multinomial probit with panel data on householdlevel choices

yh,t = Xh,tβh + εh,tεh,t ∼ N(0,Λ)

βh = ∆zh + vh, vh ∼ N(0, Vβ)

• Households h = 1, ...,H and time t = 1, ..., T

• Xh,t covariates and zh demographics

• Note that household specific random coefficients

βh remain fixed over time

• Ih,t observed choice

• The posterior distributions are derived in Appen-dix A.

• Formally, the derivations are very close to ourmultinomial probit model above.

• Gibbs sampling is used to simulate the posteriordistribution of Λ,∆, Vβ

8 Predictive Distributions

• The authors wish to give different coupons to dif-ferent households.

• A rational (Bayesian) decision maker would formher beliefs about household h’s preference para-

meters given her posterior about the model para-

meters.

• This will involve, as we show below, forming a

predictive distribution for βh given the econome-

trician’s information set.

• As a first case, suppose that the econometricianonly knew zh, the demographics of household h

• From our model, p(βh|zh,∆, Vβ) is N(∆zh, Vβ)

• Given the posterior p(∆, Vβ|Data), the econome-

tricians predictive distribution for βh is:

p(βh|zh,Data) =Zp(βh|zh,∆, Vβ)p(∆, Vβ|Data)

• We can simulate p(βh|zh,Data) using Gibbs sam-

pling given our posterior simulations ∆(s), V(s)β

s = 1, ..., S:

1

S

Xp(βh|zh,∆(s), V

(s)β )

• We could draw random βh from p(βh|zh,Data).

• For each∆(s), V (s)β , draw β(s)h from p(βh|zh,∆(s), V

(s)β )

• Given β(s)h , s = 1, ..., S, we could then simulate

purchase probabilities.

• Draw ε(s)ht from εh,t ∼ N(0,Λ(s))

• The posterior purchase probability for j, givenXht and zh is:

1

S

Xs1½Xjhtβh + ε

(s)jht > Xj0htβh + ε

(s)j0ht for j

0 6= j¾

• This would allow us to simulate the purchase re-sponse to different couponing strategies for a spe-

cific household h.

• The paper runs through different couponing strate-gies given different information set (e.g. full or

choice only information sets).

• The key ideas are similar- form a predictive distri-

bution for h’s preferences and simulate purchase

behavior in an analogous fashion.

• In the case of a full purchase information his-tory, we could use the raw Gibbs output since

the markov chain will simulate β(s)h s = 1, ..., S.

• This could then be used to simulate choice be-havior as in the example above (given draws of

ε(s)ht )

9 Data

• AC Neilson scanner panel data for tuna in Spring-field Missouri.

• 400 households, 1.5 years, 1-61 purchases.

• Brands and covariates in Table 2.

• Demographics Table 3.

• Table 4, delta coefficients.

• Poorer people prefer private label.

• Goodness of fit moderate for demographic coeffi-cients

• Figures 1 and 2, household level coefficient esti-mates with different information sets

• Table 5, return to different marketing strategies.

• Bottom line, you gain .5 cents to 1.0 cents per

customer through better estimates.

• With a lot of customers, this could be quite prof-itable.

1 bayesian discrete choice. - university of...

Documents