to cite this chapter: kruschke, j. k., & vanpaemel, …to cite this chapter: kruschke, j. k.,...

To cite this chapter:

Kruschke, J. K., & Vanpaemel, W. (2015). Bayesian estimation in hierarchicalmodels. In J. Busemeyer, J. Townsend, Z. J. Wang, & A. Eidels (Eds.), TheOxford handbook of computational and mathematical psychology (pp. 279-299).Oxford: Oxford University Press.

C H A P T E R

13Bayesian Estimation in Hierarchical Models

John K. Kruschke and Wolf Vanpaemel

Abstract

Bayesian data analysis involves describing data by meaningful mathematical models, andallocating credibility to parameter values that are consistent with the data and with priorknowledge. The Bayesian approach is ideally suited for constructing hierarchical models,which are useful for data structures with multiple levels, such as data from individuals whoare members of groups which in turn are in higher-level organizations. Hierarchical modelshave parameters that meaningfully describe the data at their multiple levels and connectinformation within and across levels. Bayesian methods are very flexible and straightforwardfor estimating parameters of complex hierarchical models (and simpler models too). Weprovide an introduction to the ideas of hierarchical models and to the Bayesian estimationof their parameters, illustrated with two extended examples. One example considersbaseball batting averages of individual players grouped by fielding position. A secondexample uses a hierarchical extension of a cognitive process model to examine individualdifferences in attention allocation of people who have eating disorders. We conclude bydiscussing Bayesian model comparison as a case of hierarchical modeling.

Key Words: Bayesian statistics, Bayesian data analysis, Bayesian modeling, hierarchicalmodel, model comparison, Markov chain Monte Carlo, shrinkage of estimates,multiple comparisons, individual differences, cognitive psychometrics, attentionallocation

The Ideas of Hierarchical BayesianEstimation

Bayesian reasoning formalizes the reallocationof credibility over possibilities in consideration ofnew data. Bayesian reasoning occurs routinely ineveryday life. Consider the logic of the fictionaldetective Sherlock Holmes, who famously said thatwhen a person has eliminated the impossible, thenwhatever remains, no matter how improbable, mustbe the truth (Doyle, 1890). His reasoning beganwith a set of candidate possibilities, some of whichhad low credibility a priori. Then he collectedevidence through detective work, which ruled outsome possibilities. Logically, he then reallocatedcredibility to the remaining possibilities. Thecomplementary logic of judicial exoneration is also

commonplace. Suppose there are several unaffiliatedsuspects for a crime. If evidence implicates one ofthem, then the other suspects are exonerated. Thus,the initial allocation of credibility (i.e., culpability)across the suspects was reallocated in response tonew data.

In data analysis, the space of possibilities consistsof parameter values in a descriptive model. Forexample, consider a set of data measured on acontinuous scale, such as the weights of a groupof 10-year-old children. We might want to describethe set of data in terms of a mathematical normaldistribution, which has two parameters, namelythe mean and the standard deviation. Beforecollecting the data, the possible means and standarddeviations have some prior credibility, about which

279

Kruschke, J. K. and Vanpaemel, W. (2015). Bayesian estimation in hierarchical models. In: J. R. Busemeyer, Z. Wang, J. T. Townsend, and A. Eidels (Eds.), The Oxford Handbook of Computational and Mathematical Psychology, pp. 279-299.

Oxford, UK: Oxford University Press.

Kruschke, J. K. and Vanpaemel, W. (2015). Bayesian estimation in hierarchical models. In: J. R. Busemeyer, Z. Wang, J. T. Townsend, and A. Eidels (Eds.), The Oxford Handbook of Computational and Mathematical Psychology, pp. 279-299. Oxford, UK: Oxford University Press.

we might be very uncertain or highly informed.After collecting the data, we reallocate credibilityto values of the mean and standard deviation thatare reasonably consistent with the data and with ourprior beliefs. The reallocated credibilities constitutethe posterior distribution over the parameter values.

We care about parameter values in formal modelsbecause the parameter values carry meaning. Whenwe say that the mean weight is 32 kilograms and thestandard deviation is 3.2 kilograms, we have a clearsense of how the data are distributed (accordingto the model). As another example, suppose wewant to describe children’s growth with a simplelinear function, which has a slope parameter. Whenwe say that the slope is 5 kilograms per year, wehave a clear sense of how weight changes throughtime (according to the model). The central goalof Bayesian estimation, and a major goal of dataanalysis generally, is deriving the most credibleparameter values for a chosen descriptive model,because the parameter values are meaningful in thecontext of the model.

Bayesian estimation provides an entire distri-bution of credibility over the space of param-eter values, not merely a single “best” value.The distribution precisely captures our uncertaintyabout the parameter estimate. The essence ofBayesian estimation is to formally describe howuncertainty changes when new data are taken intoaccount.

Hierarchical Models Have Parameters withHierarchical Meaning

In many situations, the parameters of a modelhave meaningful dependencies on each other. As asimplistic example, suppose we want to estimate theprobability that a type of trick coin, manufacturedby the Acme Toy Company, comes up heads.We know that different coins of that type havesomewhat different underlying biases to come upheads, but there is a central tendency in the biasimposed by the manufacturing process. Thus, whenwe flip several coins of that type, each severaltimes, we can estimate the underlying biases ineach coin and the typical bias and consistencyof the manufacturing process. In this situation,the observed heads of a coin depend only on thebias in the individual coin, but the bias in thecoin depends on the manufacturing parameters.This chain of dependencies among parametersexemplifies a hierarchical model (Kruschke, 2015,Ch. 9).

As another example, consider research intochildhood obesity. The researchers measure weightsof children in a number of different schools thathave different school lunch programs, and from anumber of different school districts that may havedifferent but unknown socioeconomic statuses. Inthis case, a child’s weight might be modeled asdependent on his or her school lunch program.The school lunch program is characterized byparameters that indicate the central tendency andvariability of weights that it tends to produce. Theparameters of the school lunch program are, inturn, dependent on the school’s district, whichis described by parameters indicating the centraltendency and variability of school-lunch parametersacross schools in the district. This chain ofdependencies among parameters again exemplifiesa hierarchical model.

In general, a model is hierarchical if theprobability of one parameter can be conceivedto depend on the value of another parameter.Expressed formally, suppose the observed data,denoted D, are described by a model with twoparameters, denoted α and β. The probability ofthe data is a mathematical function of the parametervalues, denoted by p(D|α,β), which is called thelikelihood function of the parameters. The priorprobability of the parameters is denoted p(α,β).Notice that the likelihood and prior are expressed,so far, in terms of combinations of α and β inthe joint parameter space. The probability of thedata, weighted by the probability of the parametervalues, is the product, p(D|α,β)p(α,β). The modelis hierarchical if that product can be factored as achain of dependencies among parameters, such asp(D|α,β)p(α,β)= p(D|α)p(α|β)p(β).

Many models can be reparameterized, and con-ditional dependencies can be revealed or obscuredunder different parameterizations. The notion ofhierarchical has to do with a particular meaningfuldefinition of a model structure that expressesdependencies among parameters in a meaningfulway. In other words, it is the semantics of theparameters when factored in the corresponding waythat makes a model hierarchical. Ultimately, anymultiparameter model merely has parameters in ajoint space, whether that joint space is conceived ashierarchical or not. Many realistic situations involvenatural hierarchical meaning, as illustrated by thetwo major examples that will be described at lengthin this chapter.

One of the primary applications of hierarchicalmodels is describing data from individuals within

280 n e w d i r e c t i o n s



groups. A hierarchical model may have parametersfor each individual that describe each individual’stendencies, and the distribution of individualparameters within a group is modeled by a higher-level distribution with its own parameters thatdescribe the tendency of the group. The individual-level and group-level parameters are estimatedsimultaneously. Therefore, the estimate of eachindividual-level parameter is informed by all theother individuals via the estimate of the group-leveldistribution, and the group-level parameters aremore precisely estimated by the jointly constrainedindividual-level parameters. The hierarchical ap-proach is better than treating each individualindependently because the data from differentindividuals meaningfully inform one another. Andthe hierarchical approach is better than collapsingall the individual data together because collapseddata may blur or obscure trends within eachindividual.

Advantages of the Bayesian ApproachBayesian methods provide tremendous flexibility

in designing models that are appropriate for describ-ing the data at hand, and Bayesian methods providea complete representation of parameter uncertainty(i.e., the posterior distribution) that can be directlyinterpreted. Unlike the frequentist interpretation ofparameters, there is no construction of samplingdistributions from auxiliary null hypotheses. In afrequentist approach, although it may be possibleto find a maximum-likelihood estimate (MLE) ofparameter values in a hierarchical nonlinear model,the subsequent task of interpreting the uncertaintyof the MLE can be very difficult. To decide whetheran estimated parameter value is significantly dif-ferent from a null value, frequentist methodsdemand construction of sampling distributions ofarbitrarily-defined deviation statistics, generatedfrom arbitrarily-defined null hypotheses, fromwhich p values are determined for testing null hy-potheses. When there are multiple tests, frequentistdecision rules must adjust the p values. Moreover,frequentist methods are unwieldy for constructingconfidence intervals on parameters, especially forcomplex hierarchical nonlinear models that areoften the primary interest for cognitive scientists.1

Furthermore, confidence intervals change when theresearcher intention changes (e.g., Kruschke, 2013).Frequentist methods for measuring uncertainty (asconfidence intervals from sampling distributions)are fickle and difficult, whereas Bayesian methods

are inherently designed to provide clear repre-sentations of uncertainty. A thorough critique offrequentist methods such as p values would take ustoo far afield. Interested readers may consult manyother references, such as articles by Kruschke (2013)or Wagenmakers (2007).

Some Mathematics and Mechanics ofBayesian Estimation

The mathematically correct reallocation of cred-ibility over parameter values is specified by Bayes’rule (Bayes & Price, 1763):

p(α|D)︸︷︷︸posterior

= p(D|α)︸︷︷︸likelihood

p(α)︸︷︷︸prior

/p(D) (1)

where

p(D)=∫

dα p(D|α)p(α) (2)

is called the “marginal likelihood” or “evidence.”The formula in Eq. 1 is a simple consequenceof the definition of conditional probability (e.g.,Kruschke, 2015), but it has huge ramificationswhen applied to meaningful, complex models.

In some simple situations, the mathematicalform of the posterior distribution can be analyticallyderived. These cases demand that the integral inEq. 2 can be mathematically derived in conjunctionwith the product of terms in the numerator ofBayes’ rule. When this can be done, the result canbe especially pleasing because an explicit, simpleformula for the posterior distribution is obtained.

Analytical solutions for Bayes’ rule can rarely beachieved for realistically complex models. Fortu-nately, instead, the posterior distribution is approx-imated, to arbitrarily high accuracy, by generatinga huge random sample of representative parametervalues from the posterior distribution. A largeclass of algorithms for generating a representativerandom sample from a distribution is called Markovchain Monte Carlo (MCMC) methods. Regardlessof which particular sampler from the class is used, inthe long run they all converge to an accurate repre-sentation of the posterior distribution. The biggerthe MCMC sample, the finer-resolution picturewe have of the posterior distribution. Because thesampling process uses a Markov chain, the randomsample produced by the MCMC process is oftencalled a chain.

b a y e s i a n e s t i m a t i o n i n h i e r a r c h i c a l m o d e l s 281



Box 1 MCMC DetailsBecause the MCMC sampling is a randomwalk through parameter space, we would likesome assurance that it successfully explored theposterior distribution without getting stuck,oversampling, or undersampling zones of theposterior. Mathematically, the samplers will beaccurate in the long run, but we do not knowin advance exactly how long is long enough toproduce a reasonably good sample.

There are various diagnostics for assessingMCMC chains. It is beyond the scope of thischapter to review their details, but the ideasare straightforward. One type of diagnosticassesses how “clumpy” the chain is, by using adescriptive statistic called the autocorrelation ofthe chain. If a chain is strongly autocorrelated,successive steps in the chain are near each other,thereby producing a clumpy chain that takes along time to smooth out. We want a smoothsample to be sure that the posterior distributionis accurately represented in all regions of theparameter space. To achieve stable estimatesof the tails of the posterior distribution, oneheuristic is that we need about 10,000 indepen-dent representative parameter values (Kruschke,2015, Section 7.5.2). Stable estimates of centraltendencies can be achieved by smaller numbersof independent values. A statistic called theeffective sample size (ESS) takes into accountthe autocorrelation of the chain and suggestswhat would be an equivalently sized sample ofindependent values.

Another diagnostic assesses whether theMCMC chain has gotten stuck in a subset ofthe posterior distribution, rather than exploringthe entire posterior parameter space. Thisdiagnostic takes advantage of running two ormore distinct chains, and assessing the extentto which the chains overlap. If several differentchains thoroughly overlap, we have evidencethat the MCMC samples have converged to arepresentative sample.

It is important to understand that the MCMC“sample” or “chain” is a huge representative sampleof parameter values from the posterior distribution.The MCMC sample is not to be confused with thesample of data. For any particular analysis, thereis a single fixed sample of data, and there is a sin-gle underlying mathematical posterior distribution

that is inferred from the sample of data. TheMCMC chain typically uses tens of thousands ofrepresentative parameter values from the posteriordistribution to represent the posterior distribution.Box 1 provides more details about assessing whenan MCMC chain is a good representation of theunderlying posterior distribution.

Contemporary MCMC software works seam-lessly for complex hierarchical models involvingnonlinear relationships between variables and non-normal distributions at multiple levels. Model-specification languages such as BUGS (Lunn,Jackson, Best, Thomas, & Spiegelhalter, 2013;Lunn, Thomas, Best, & Spiegelhalter, 2000), JAGS(Plummer, 2003), and Stan (Stan, 2013) allowthe user to specify descriptive models to satisfytheoretical and empirical demands.

Example: Shrinkage and MultipleComparisons of Baseball Batting Abilities

American baseball is a sport in which one person,called a pitcher, throws a small ball as quickly aspossible over a small patch of earth, called homeplate, next to which is standing another personholding a stick, called a bat, who tries to hit theball with the bat. If the ball is hit appropriately intothe field, the batter attempts to run to other markedpatches of earth arranged in a diamond shape. Thebatter tries to arrive at the first patch of earth, calledfirst base, before the other players, called fielders,can retrieve the ball and throw it to a teammateattending first base.

One of the crucial abilities of baseball players is,therefore, the ability to hit a very fast ball (some-times thrown more than 90 miles [145 kilometers]per hour) with the bat. An important goal forenthusiasts of baseball is estimating each player’sability to bat the ball. Ability can not be assesseddirectly but can only be estimated by observing howmany times a player was able to hit the ball in all hisopportunities at bat, or by observing hits and at-batsfrom other similar players.

There are nine players in the field at once, whospecialize in different positions. These include thepitcher, the catcher, the first base man, the secondbase man, the third base man, the shortstop, theleft fielder, the center fielder, and the right fielder.When one team is in the field, the other team is atbat. The teams alternate being at bat and being inthe field. Under some rules, the pitcher does nothave to bat when his team is at bat.

Because different positions emphasize differentskills while on the field, not all players are prized




for their batting ability alone. In particular, pitchersand catchers have specialized skills that are crucialfor team success. Therefore, based on the structureof the game, we know that players with differentprimary positions are likely to have different battingabilities.

The DataThe data consist of records from 948 players in

the 2012 regular season of Major League Baseballwho had at least one at-bat.2 For player i, we havehis number of opportunities at bat, ABi , his numberof hits Hi , and his primary position when in thefield pp(i). In the data, there were 324 pitchers witha median of 4.0 at-bats, 103 catchers with a medianof 170.0 at-bats, and 60 right fielders with a medianof 340.5 at-bats, along with 461 players in six otherpositions.

The Descriptive Model with Its MeaningfulParameters

We want to estimate, for each player, hisunderlying probability θi of hitting the ball whenat bat. The primary data to inform our estimateof θi are the player’s number of hits, Hi , andhis number of opportunities at bat, ABi . But theestimate will also be informed by our knowledgeof the player’s primary position, pp(i), and by thedata from all the other players (i.e., their hits, at-bats, and positions). For example, if we know thatplayer i is a pitcher, and we know that pitcherstend to have θ values around 0.13 (because of allthe other data), then our estimate of θi should beanchored near 0.13 and adjusted by the specifichits and at-bats of the individual player. We willconstruct a hierarchical model that rationally sharesinformation across players within positions, andacross positions within all major league players.3

We denote the ith player’s underlying probabilityof getting a hit as θi. (See Box 2 for discussionof assumptions in modeling.) Then the number ofhits Hi out of ABi at-bats is a random draw froma binomial distribution that has success rate θi, asillustrated at the bottom of Figure 13.1. The arrowpointing to Hi is labeled with a “∼” symbol toindicate that the number of hits is a random variabledistributed as a binomial distribution.

To formally express our prior belief that differentprimary positions emphasize different skills andhence have different batting abilities, we assumethat the player abilities θi come from distributionsspecific to each position. Thus, the θi’s for the 324

Box 2 Model AssumptionsFor the analysis of batting abilities, we assumethat a player’s batting ability, θi, is constantfor all at-bats, and that the outcome of anyat-bat is independent of other at-bats. Theseassumptions may be false, but the notionof a constant underlying batting ability is ameaningful construct for our present purposes.Assumptions must be made for any statisticalanalysis, whether Bayesian or not, and theconclusions from any statistical analysis areconditional on its assumptions. An advantageof Bayesian analysis is that, relative to 20thcentury frequentist techniques, there is greaterflexibility to make assumptions that are ap-propriate to the situation. For example, ifwe wanted to build a more elaborate analysis,we could incorporate data about when inthe season the at-bats occurred, and estimatetemporal trends in ability due to practice orfatigue. Or, we could incorporate data aboutwhich pitcher was being faced in each at-bat, and we could estimate pitcher difficultiessimultaneously with batter abilities. But theseelaborations, although possible in the Bayesianframework, would go far beyond our purposesin this chapter.

pitchers are assumed to come from a distributionspecific to pitchers, that might have a differentcentral tendency and dispersion than the distribu-tion of abilities for the 103 catchers, and so onfor the other positions. We model the distributionof θi’s for a position as a beta distribution, whichis a natural distribution for describing values thatfall between zero and one, and is often used inthis sort of application (e.g., Kruschke, 2015).The mean of the beta distribution for primaryposition pp is denoted μpp, and the narrowness ofthe distribution is denoted κpp. The value of μpprepresents the typical batting ability of players inprimary position pp, and the value of κpp representshow tightly clustered the abilities are across playersin primary position pp. The κ parameter issometimes called the concentration or precision ofthe beta distribution.4 Thus, an individual playerwhose primary position is pp(i) is assumed to have abatting ability θi that comes from a beta distributionwith mean μpp(i) and precision κpp(i). The valuesof μpp and κpp are estimated simultaneously withall the θi . Figure 13.1 illustrates this aspect ofthe model by showing an arrow pointing to θi




from a beta distribution. The arrow is labeled with“∼ . . . i” to indicate that the θi have credibilitiesdistributed as a beta distribution for each of theindividuals. The diagram shows beta distributionsas they are conventionally parameterized by twoshape parameters, denoted app and bpp, that can bealgebraically redescribed in terms of the mean μppand precision κpp of the distribution: app = μppκppand bpp= (1−μpp)κpp.

To formally express our prior knowledge thatall players, from all positions, are professionalsin major league baseball, and, therefore, shouldmutually inform each other’s estimates, we assumethat the nine position abilities μpp come froman overarching beta distribution with mean μμpp

and precision κμpp . This structure is illustratedin the upper part of Figure 13.1 by the splitarrow, labeled with “∼ . . .pp”, pointing to μppfrom a beta distribution. The value of μμpp inthe overarching distribution represents our estimateof the batting ability of major league playersgenerally, and the value of κμpp represents howtightly clustered the abilities are across the ninepositions. These across-position parameters are

estimated from the data, along with all the otherparameters.

The precisions of the nine distributions arealso estimated from the data. The precisions ofthe position distributions, κpp, are assumed tocome from an overarching gamma distribution,as illustrated in Figure 13.1 by the split arrow,labeled with “∼ . . .pp”, pointing to κpp from agamma distribution. A gamma distribution is ageneric and natural distribution for describing non-negative values such as precisions (e.g., Kruschke,2015). A gamma distribution is conventionallyparameterized by shape and rate values, denotedin Figure 13.1 as sκpp and rκpp . We assume thatthe precisions of each position can mutually informeach other; that is, if the batting abilities ofcatchers are tightly clustered, then the battingabilities or shortstops should probably also betightly clustered, and so forth. Therefore the shapeand rate parameters of the gamma distribution arethemselves estimated.

At the top level in Figure 13.1 we incorporateany prior knowledge we might have about generalproperties of batting abilities for players in the

Fig. 13.1 The hierarchical descriptive model for baseball batting ability. The diagram should be scanned from the bottom up. At thebottom, the number of hits by the i th player, Hi , are assumed to come from a binomial distribution with maximum value being theat-bats, ABi , and probability of getting a hit being θi . See text for further details.




major leagues, such as evidence from previousseasons of play. Baseball aficionados may haveextensive prior knowledge that could be usefullyimplemented in a Bayesian model. Unlike base-ball experts, we have no additional backgroundknowledge, and, therefore, we will use very vagueand noncommittal top-level prior distributions.Thus, the top-level beta distribution on the overallbatting ability is given parameter values A = 1and B = 1, which make it uniform over allpossible batting abilities from zero to one. Thetop-level gamma distributions (on precision, shape,and rate) are given parameter values that makethem extremely broad and noncommittal such thatthe data dominate the estimates, with minimalinfluence from the top-level prior.

There are 970 parameters in the model alto-gether: 948 individual θi, plus μpp, κpp for eachof nine primary positions, plus μμ, κμ acrosspositions, plus sκ and rκ . The Bayesian analysisyields credible combinations of the parameters in the970-dimensional joint parameter space.

We care about the parameter values because theyare meaningful. Our primary interest is in theestimates of individual batting abilities, θi, andin the position-specific batting abilities, μpp. Weare also able to examine the relative precisions ofabilities across positions to address questions suchas, Are batting abilities of catchers as variable asbatting abilities of shortstops? We will not do sohere, however.

Results: Interpreting the PosteriorDistribution

We used MCMC chains with total saved lengthof 15,000 after adaptation of 1,000 steps and burn-in of 1,000 steps, using 3 parallel chains called fromthe runjags package (Denwood, 2013), thinned by30 merely to keep a modest file size for the savedchain. The diagnostics (see Box 1) assured us thatthe chains were adequate to provide an accurateand high-resolution representation of the posteriordistribution. The effective sample size (ESS) for allthe reported parameters and differences exceeded6,000, with nearly all exceeding 10,000.

check of robustness against changes intop-level prior constants

Because we wanted the top-level prior distri-bution to be noncommittal and have minimalinfluence on the posterior distribution, we checkedwhether the choice of prior had any notable effecton the posterior. We conducted the analysis with

different constants in the top-level gamma distri-butions, to check whether they had any notableinfluence on the resulting posterior distribution.Whether all gamma distributions used shape andrate constants of 0.1 and 0.1, or 0.001 and 0.001,the results were essentially identical. The resultsreported here are for gamma constants of 0.001 and0.001.

comparisons of positionsWe first consider the estimates of hitting ability

for different positions. Figure 13.2, left side, showsthe marginal posterior distributions for the μppparameters for the positions of catcher and pitcher.The distributions show the credible values of theparameters generated by the MCMC chain. Thesemarginal distributions collapse across all otherparameters in the high-dimensional joint parameterspace. The lower-left panel in Figure 13.2 showsthe distribution of differences between catchers andpitchers. At every step in the MCMC chain, thedifference between the credible values of μcatcherand μpitcher was computed, to produce a credible

value for the difference. The result is 15,000credible differences (one for each step in theMCMC chain).

For each marginal posterior distribution, weprovide two summaries: Its approximate mode,displayed on top, and its 95% highest density interval(HDI), shown as a black horizontal bar. A param-eter value inside the HDI has higher probabilitydensity (i.e., higher credibility) than a parametervalue outside the HDI. The total probability ofparameter values within the 95% HDI is 95%.The 95% HDI indicates the 95% most credibleparameter values.

The posterior distribution can be used to makediscrete decisions about specific parameter values(as explained in Box 3). For comparing catchersand pitchers, the distribution of credible differencesfalls far from zero, so we can say with highcredibility that catchers hit better than pitchers.(The difference is so big that it excludes anyreasonable ROPE around zero that would be usedin the decision rule described in Box 3.)

The right side of Figure 13.2 shows the marginalposterior distributions of the μpp parametersfor the positions of right fielder and catcher.The lower-right panel shows the distribution ofdifferences between right fielders and catchers. The95% HDI of differences excludes a difference ofzero, with 99.8% of the distribution falling abovezero. Whether or not we reject zero as a credible




Catcher

μpp=2

0.15 0.20 0.25

mode = 0.241

95% HDI0.233 0.25

Pitcher

μpp=1

0.15 0.20 0.25

mode = 0.13

95% HDI0.12 0.141

Difference

μpp = 2−μpp = 1

0.00 0.04 0.08 0.12

mode = 0.1110% < 0 < 100%

0% in ROPE

95% HDI0.0976 0.125

Right Field

μpp=9

0.23 0.24 0.25 0.26 0.27 0.28

mode = 0.26

95% HDI0.251 0.269

Catcher

μpp=2

0.23 0.24 0.25 0.26 0.27 0.28

mode = 0.241

95% HDI0.233 0.25

Difference

μpp=9−μpp=2

0.00 0.01 0.02 0.03 0.04

mode = 0.01830.2% < 0 < 99.8%

10% in ROPE

95% HDI0.0056 0.031

Fig. 13.2 Comparison of estimated batting abilities of different positions. In the data, there were 324 pitchers with a median of 4.0at-bats, 103 catchers with a median of 170.0 at-bats, and 60 right fielders with a median of 340.5 at-bats, along with 461 players in sixother positions. The modes and HDI limits are all indicated to three significant digits, with a trailing zero truncated from the display.In the lowest row, a difference of 0 is marked by a vertical dotted line annotated with the amount of the posterior distribution thatfalls below or above 0. The limits of the ROPE are marked with vertical dotted lines and annotated with the amount of the posteriordistribution that falls inside it. The subscripts such as “pp=2” indicate arbitrary indexical values for the primary positions, such as 1for pitcher, 2 for catcher, and so forth.

difference depends on our decision rule. If weuse a ROPE from −0.01 to +0.01, as shown inFigure 13.2, then we would not reject a differenceof zero because the 95% HDI overlaps the ROPE.The choice of ROPE depends on what is practicallyequivalent to zero as judged by aficionados ofbaseball. Our choice of ROPE shown here is merelyfor illustration.

In Figure 13.2, the triangle on the x-axisindicates the ratio in the data of total hits dividedby total at-bats for all players in that position.Notice that the modes of the posterior are notcentered exactly on the triangles. Instead, themodal estimates are shrunken toward the middle

between the pitchers (who tend to have the lowestbatting averages) and the catchers (who tend to havehigher batting averages). Thus, the modes of theposterior marginal distributions are not as extremeas the proportions in the data (marked by thetriangles). This shrinkage is produced by the mutualinfluence of data from all the other players, becausethey influence the higher-level distributions, whichin turn influence the lower-level estimates. Forexample, the modal estimate for catchers is 0.241,which is less than the ratio of total hits to totalat-bats for catchers. This shrinkage in the estimatefor catchers is caused by the fact that there are 324pitchers who, as a group, have relatively low batting




Box 3 Decision Rules for BayesianPosterior DistributionThe posterior distribution can be used formaking decisions about the viability of specificparameter values. In particular, people might beinterested in a landmark value of a parameter,or a difference of parameters. For example,we might want to know whether a particularposition’s batting ability exceeds 0.20, say. Orwe might want to know whether two positions’batting abilities have a non-zero difference.

The decision rule involves using a regionof practical equivalence (ROPE) around thenull or landmark value. Values within theROPE are equivalent to the landmark valuefor practical purposes. For example, we mightdeclare that for batting abilities, a differenceless than 0.04 is practically equivalent to zero.To decide that two positions have crediblydifferent batting abilities, we check that the95% HDI excludes the entire ROPE aroundzero. Using a ROPE also allows accepting adifference of zero: If the entire 95% HDI fallswithin the ROPE, it means that all the mostcredible values are practically equivalent to zero(i.e., the null value), and we decide to acceptthe null value for practical purposes. If the95% HDI overlaps the ROPE, we withholddecision. Note that it is only the landmarkvalue that is being rejected or accepted, notall the values inside the ROPE. Furthermore,the estimate of the parameter value is given bythe posterior distribution, whereas the decisionrule merely declares whether the parametervalue is practically equivalent to the landmarkvalue. We will illustrate use of the decisionrule in the results from the actual analyses.In some cases we will not explicitly specifya ROPE, leaving some nonzero width ROPEimplicit. In general, this allows flexibilityin decision-making when limits of practicalequivalence may change as competing theoriesand instrumentation change (Serlin & Lapsley.1993). In some cases, the posterior distributionfalls so far away from any reasonable ROPEthat it is superfluous to specify a specific ROPE.For more information about the applicationof a ROPE, under somewhat different termsof “range of equivalence,” “indifference zone,”and “good-enough belt,” see e.g., Carlin andLouis (2009); Freedman, Lowe, and Macaskill(1984); Hobbs and Carlin (2008); Serlin and

Lapsley (1985, 1993); Spiegelhalter, Freedman,and Parmar (1994).

Notice that the decision rule is distinct fromthe Bayesian estimation itself, which producesthe complete posterior distribution. We areusing a decision rule only in case we demanda discrete decision from the continuous pos-terior distribution. There is another Bayesianapproach to making decisions about null valuesthat is based on comparing a “spike” prioron the landmark value against a diffuse prior,which we discuss in the final section on modelcomparison, but for the purposes of this chapterwe focus on using the HDI with ROPE.

ability, and pull down the overarching estimate ofbatting ability for major-league players (even withthe other seven positions taken into account). Theoverarching estimate in turn affects the estimate ofall positions, and, in particular, pulls down theestimate of batting ability for catchers. We see inthe upper right of Figure 13.2 that the estimate ofbatting ability for right fielders is also shrunken, butnot as much as for catchers. This is because the rightfielders tend to be at bat much more often thanthe catchers, and, therefore, the estimate of abilityfor right fielders more closely matches their dataproportions. In the next section we examine resultsfor individual players, and the concepts of shrinkagewill become more dramatic and more clear.

comparisons of individual playersIn this section we consider estimates of the

batting abilities of individual players. The leftside of Figure 13.3 shows a comparison of twoindividual players with the same record, 1 hit in 3at-bats, but who play different positions, namelycatcher and pitcher. Notice that the triangles areat the same place on the x-axes for the twoplayers, but there are radically different estimatesof their probability of getting a hit because of thedifferent positions they play. The data from allthe other catchers inform the model that catcherstend to have values of θ around 0.241. Becausethis particular catcher has so few data to informhis estimate, the estimate from the higher-leveldistribution dominates. The same is true for thepitcher, but the higher-level distribution says thatpitchers tend to have values of θ around 0.130.The resulting distribution of differences, in thelowest panel, suggests that these two players have




Tim Federowicz (Catcher)1 Hits/3 At Bats

θ263

0.05 0.15 0.25 0.35

mode = 0.241

95% HDI0.191 0.297

Casey Coleman (Pitcher)1 Hits/3 At Bats

θ169

0.05 0.15 0.25 0.35

mode = 0.132

95% HDI0.0905 0.176

Difference

θ263−θ169

−0.05 0.05 0.10 0.15 0.20 0.25

mode = 0.1110.1% < 0 < 99.9%

2% in ROPE

95% HDI0.0419 0.178

Mike Leake (Pitcher)18 Hits/61 At Bats

θ494

0.05 0.10 0.15 0.20 0.25 0.30

mode = 0.157

95% HDI0.119 0.209

Wandy Rodriguez ( Pitcher)4 Hits/61 At Bats

θ754

0.05 0.10 0.15 0.20 0.25 0.30

mode = 0.112

95% HDI0.0825 0.156

Difference

θ494−θ754

−0.05 0.00 0.05 0.10 0.15 0.20

mode = 0.03655.6% < 0 < 94.4%

46% in ROPE

95% HDI−0.0151 0.105

Fig. 13.3 Comparison of estimated batting abilities of different individual players. The left column shows two players with the sameactual records of 1 hit in 3 at-bats, but very different estimates of batting ability because they play different positions. The right columnshows two players with rather different actual records (18/61 and 4/61) but similar estimates of batting ability because they play the sameposition. Triangles show actual ratios of hits/at-bats. Bottom histograms display an arbitrary ROPE from −0.04 to +0.04; differentdecision makers might use a different ROPE. The subscripts on θ indicate arbitrary identification numbers of different players, such as263 for Tim Federowicz.

credibly different hitting abilities, even though theiractual hits and at-bats are identical. In other words,because we know the players play these particulardifferent positions, we can infer that they probablyhave different hitting abilities.

The right side of Figure 13.3 shows anothercomparison of two individual players, both ofwhom are pitchers, with seemingly quite differentbatting averages of 18/61 and 4/61, as marked bythe triangles on the x-axis. Despite the players’different hitting records, the posterior estimates oftheir hitting probabilities are not very different.Notice the dramatic shrinkage of the estimatestoward the mode of players who are pitchers.

Indeed, in the lower panel, we see that a differenceof zero is credible, as it falls within the 95% HDIof the differences. The shrinkage is producedbecause there is a huge amount of data, from 324pitchers, informing the position-level distributionabout the hitting ability of pitchers. Therefore,the estimates of two individual pitchers with onlymodest numbers of at-bats are strongly shrunkentoward the group-level mode. In other words,because we know that the players are both pitchers,we can infer that they probably have similar hittingabilities.

The amount of shrinkage depends on theamount of data. This is illustrated in Figure 13.4,




Andrew McCutchen ( Center Field)194 Hits/593 At Bats

θ573

0.15 0.20 0.25 0.30 0.35

mode = 0.304

95% HDI0.274 0.335

Brett Jackson (Center Field)21 Hits/120 At Bats

θ428

0.15 0.20 0.25 0.30 0.35

mode = 0.233

95% HDI0.194 0.278

Difference

θ573−θ428

0.00 0.05 0.10 0.15

mode = 0.06430.5% < 0 < 99.5%

14% in ROPE

95% HDI0.0171 0.122

ShinSoo Choo ( Right Field)169 Hits/598 At Bats

θ159

0.22 0.24 0.26 0.28 0.30 0.32

mode = 0.276

95% HDI0.246 0.302

Ichiro Suzuki ( Right Field) 178 Hits/629 At Bats

θ844

0.22 0.24 0.26 0.28 0.30 0.32

mode = 0.275

95% HDI0.248 0.304

Difference

θ159− θ 844

−0.05 0.00 0.05

mode = −0.0021250.8% < 0 < 49.2%

95% in ROPE

95% HDI−0.0398 0.039

Fig. 13.4 The left column shows two individuals with rather different actual batting ratios (194/593 and 21/120) who both playcenter field. Although there is notable shrinkage produced by playing the same position, the quantity of data is sufficient to exclude adifference of zero from the 95% HDI on the difference (lower histogram); although the HDI overlaps the arbitrary ROPE shown here,different decision makers might use a different ROPE. The right column shows two right fielders with very high and nearly identicalactual batting ratios. The 95% HDI of their difference falls within the ROPE in the lower right histogram. Note: Triangles show actualbatting ratio of hits/at-bats.

which shows comparisons of players from thesame position, but for whom there are muchmore personal data from more at-bats. In thesecases, although there is some shrinkage caused byposition-level information, the amount of shrinkageis not as strong because the additional individ-ual data keep the estimates anchored closer tothe data.

The left side of Figure 13.4 shows a comparisonof two center fielders with 593 and 120 at-bats,respectively. Notice that the shrinkage of estimatefor the player with 593 at-bats is not as extreme asthe player with 120 at-bats. Notice also that thewidth of the 95% HDI for the player with 593at-bats is narrower than for the player with 120

at-bats. This again illustrates the concept that theestimate is informed by both the data from theindividual player and by the data from all the otherplayers, especially those who play the same position.The lower left panel of Figure 13.4 shows that theestimated difference excludes zero (but still overlapsthe particular ROPE used here).

The right side of Figure 13.4 shows right fielderswith huge numbers of at-bats and nearly the samebatting average. The 95% HDI of the differencefalls almost entirely within the ROPE, so wemight decide to declare that players have identicalprobability of getting a hit for practical purposes,that is, we might decide to accept the null value ofzero difference.




Shrinkage and Multiple ComparisonsIn hierarchical models with multiple levels, there

is shrinkage of estimates within each level. In themodel of this section (Figure 13.1), there wasshrinkage of the player-position parameters towardthe overall central tendency, as illustrated by thepitcher and catcher distributions in Figure 13.2,and there was shrinkage of the individual-playerparameters within each position toward the positioncentral tendency, as shown by various examplesin Figures 13.3 and Figure 13.4. The modelalso provided some strong inferences about playerabilities based on position alone, as illustrated bythe estimates for individual players with few at batsin the left column of Figure 13.3.

There were no corrections for multiple compar-isons. We conducted all the comparisons withoutcomputing p values, and without worrying whetherwe might intend to make additional comparisons inthe future, which is quite likely given that there are9 positions and 948 players in whom we might beinterested.

It is important to be clear that Bayesian methodsdo not prevent false alarms. False alarms are causedby accidental conspiracies of rogue data that happento be unrepresentative of the true population,and no analysis method can fully mitigate falseconclusions from unrepresentative data. There aretwo main points to be made with regard to falsealarms in multiple comparisons from a Bayesianperspective.

First, the Bayesian method produces a posteriordistribution that is fixed, given the data. Theposterior distribution does not depend on whichcomparisons are intended by the analyst, unliketraditional frequentist methods. Our decision rule,using the HDI and ROPE, is based on the posteriordistribution, not on a false alarm rate inferred froma null hypothesis and an intended sampling/testingprocedure.

Second, false alarms are mitigated by shrinkagein hierarchical models (as exemplified in the rightcolumn of Figure 13.3). Because of shrinkage,it takes more data to produce a credible differ-ence between parameter values. Shrinkage is arational, mathematical consequence of the hierar-chical model structure (which expresses our priorknowledge of how parameters are related) and theactually observed data. Shrinkage is not related inany way to corrections for multiple comparisons,which do not depend on the observed data but dodepend on the intended comparisons. Hierarchicalmodeling is possible with non-Bayesian estimation,

but frequentist decisions are based on auxiliarysampling distributions instead of the posteriordistribution.

Example: Clinical Individual Differences inAttention Allocation

Hierarchical Bayesian estimation can be appliedstraightforwardly to more elaborate models, suchas information processing models typically used incognitive science. Generally, such models formallydescribe the processes underlying behavior in taskssuch as thinking, remembering, perceiving, de-ciding, learning and so on. Cognitive models areincreasingly finding practical uses in a wide varietyof areas outside cognitive science. One of the mostpromising uses of cognitive process models is thefield of cognitive psychometrics (Batchelder, 1998;Riefer, Knapp, Batchelder, Bamber, & Manifold,2002; Vanpaemel, 2009), where cognitive processmodels are used as psychometric measurementmodels. These models have become importanttools for quantitative clinical cognitive science (seeNeufeld chapter 16, this volume).

In our second example of hierarchical Bayesianestimation, we use data from a classification taskand a corresponding cognitive model to assessyoung women’s attention to other women’s bodysize and facial affect, following the research of Treat,Nosofsky, McFall, & Palmeri, (2002). Rather thanrelying on self-reports, Viken et a1. (2002) collectedperformance data in a prototype-classification taskinvolving photographs of women varying in bodysize and facial affect. Furthermore, rather thanusing generic statistical models for data analysis,the researchers applied a computational model ofcategory learning designed to describe underlyingpsychological properties. The model, known asthe multiplicative prototype model (MPM: Nosofsky,1987; Reed, 1972), has parameters that describehow much perceptual attention is allocated to bodysize or facial affect. The modeling made it possibleto assess how participants in the task allocated theirattention.

To measure attention allocation, Viken et al.(2002) tapped into women’s perceived similaritiesof photographs of other women. The women inthe photographs varied in their facial expressionsof affect (happy to sad) and in their body size(light to heavy). We focus here on a particularcategorization task in which the observer had toclassify a target photo as belonging with referencephoto X or with reference photo Y. In one version




Light Heavy

Sad

Hap

pyAff

ect

Sad

Hap

pyAff

ect

HeavyLight

X

X

t

t

Y

Y

Body SizeBody Size

Fig. 13.5 The perceptual space for photographs of women whovary on body size (horizontal axis) and affect (vertical axis).Photo X shows a prototypical light, happy woman and photoY shows a prototypical heavy, sad woman. The test photo, t ,is categorized with X or Y according to its relative perceptualproximity to those prototypes. In the left panel, attention tobody size (denoted w in the text) is low, resulting in compressionof the body size axis, and, therefore, test photo t tends to beclassified with prototype X. In the right panel, attention to bodysize is high, resulting in expansion of the body size axis, and,therefore, test photo t tends to be classified with prototype Y.

of the experiment, reference photo X was of a light,happy woman and reference photo Y was of a heavy,sad woman. In another version, not discussed here,the features of the reference photos were reversed.Suppose the target photo t showed a heavy, happywoman. If the observer was paying attention mostlyto affect, then photo t should tend to be classifiedwith reference photo X, which matched on affect.If the observer was paying attention mostly to bodysize, then photo t should tend to be classified withreference photo Y, which matched on body size.A schematic representation of the perceptual spacefor photographs is shown in Figure 13.5. In theactual experiment, there were many different targetphotos from throughout the perceptual space. Byrecording how each target photo was categorized bythe observer, the observer’s attention allocation canbe inferred.

Viken et al. (2002) were interested in whetherwomen suffering from the eating disorder, bulimia,allocated their attention differently than normalwomen. Bulimia is characterized by bouts ofoverconsumption of food with a feeling of loss ofcontrol, followed by self-induced vomiting or abuseof laxatives to prevent weight gain. The researcherswere specifically interested in how bulimics allo-cated their attention to other women’s facial affectand body size, because perception of body size hasbeen the focus of past research into eating disorders,and facial affect is relevant to social perception butis not specifically implicated in eating disorders. Anunderstanding of how bulimics allocate attention

could have implications for both the etiology andtreatment of the disease.

Viken et al. (2002) collected data from a groupof woman who were high in bulimic symptoms,and from a group that was low. Viken et al. thenused likelihood-ratio tests to compare a model thatused separate attention weights in each group toa model that used a single attention weight forboth groups. Their model-comparison approachrevealed that high-symptom women, relative tolow-symptom women, display enhanced attentionto body size and decreased attention to facialaffect.

In contrast to their non-Bayesian, nonhierarchi-cal, nonestimation approach, we use a Bayesianhierarchical estimation approach to investigate thesame issue. The hierarchical nature of our approachmeans that we do not assume that all subjectswithin a symptom group have the same attention tobody size. Bayesian inference and decision-makingimplies that we do not require assumptions aboutsampling intentions and multiple tests that arerequired for computing p values. Moreover, ouruse of estimation instead of only model comparisonensures that we will know how much the groupsdiffer.

The DataViken et a1. (2002) obtained classification

judgments from 38 women on 22 pictures of otherwomen, varying in body size (light to heavy) andfacial affect (happy to sad). Symptoms of bulimiawere also measured for all of the women. Eighteenof these women had BULIT scores exceeding 88,which is considered to be high in bulimic symptoms(Smith & Thelen, 1984). The remaining 20women had BULIT scores lower than 45, which isconsidered to be low in bulimic symptoms. Eachwoman performed the classification task describedearlier, in which she was instructed to freely classifyeach target photo t as one of two types of womenexemplified by reference photo X and referencephoto Y. No corrective feedback was provided.Each target photo was presented twice, hence, foreach woman i, the data include the frequency ofclassifying stimulus t as a type X, ranging between0 and 2. Our goal is to use these data to infera meaningful measure of attention allocation foreach individual observer, and simultaneously toinfer an overall measure of attention allocationfor women high in bulimic symptoms and forwomen low in bulimic symptoms. We will rely




on a hierarchical extension of the MPM, asdescribed next.

The Descriptive Model with ItsMeaningful Parameters

Models of categorization take perceptual stimulias input and generate precise probabilities ofcategory assignments as output. The input stimulimust be represented formally, and many leadingcategorization models assume that stimuli can berepresented as points in a multidimensional space,as was suggested in Figure 13.5. Importantly, themodels assume that attention plays a key role incategorization, and formalize the attention allocatedto perceptual dimensions as free parameters (for areview see, e.g., Kruschke, 2008). In particular, theMPM (Nosofsky, 1987) determines the similaritybetween a target item and a reference item bymultiplicatively weighting the separation of theitems on each dimension by the correspondingattention allocated to each dimension. The higherthe similarity of a stimulus to a reference categoryprototype, relative to other category prototypes, thehigher the probability of assigning the stimulus tothe reference category.

For each trial in which a target photo t ispresented with reference photos X and Y, theMPM produces the probability, pi(X |t), that thei th observer classifies stimulus t as category X .This probability depends on two free parameters.One parameter is denoted wi, which indicatesthe attention that the i th observer pays to bodysize. The value of wi can range from 0 to 1.Attention to affect is simply 1− wi. The secondparameter is denoted ci and called the “sensitivity”of observer i. The sensitivity can be thought of asthe observer’s decisiveness, which is how stronglythe observer converts a small similarity advantagefor X into a large choice advantage for X. Note thatattention and sensitivity parameters can differ acrossobservers, but not across stimuli, which are assumedto have fixed locations in an underlying perceptualspace.

Formally, the MPM posits that the probabilitythat photo t will be classified with reference photoX instead of reference photo Y is determinedby the similarity of t to X relative to the totalsimilarity:

pi(X |t)= stX /(stX + stY ). (3)

The similarity between target and reference is,in turn, determined as a nonlinearly decreasing

function of distance between t and X , dtX , in thepsychological space:

stX = exp(−ci dtX ) (4)

where ci > 0 is the sensitivity parameter forobserver i. The psychological distance betweentarget t and reference X is given by the weighteddistance between the corresponding points in the2-dimensional psychological space:

dtX =[wi |xtb− xXb|2+ (1−wi) |xta− xXa|2

]1/2,

(5)

where xta denotes the position of the target on theaffect dimension, and xtb denotes the position of thetarget on the body-size dimension. These positionsare normative average ratings of the photographs ontwo 10-point scales: body size (1 = underweight,10 = overweight), and affect (1 = unhappy, 10 =happy), as provided by a separate sample of youngwomen. The free parameter 0<wi < 1 correspondsto the attention weight on the body size dimensionfor observer i. It reflects the key assumption of theMPM that the structure of the psychological spaceis systematically modified by selective attention (seeFigure 13.5).

hierarchical structureWe construct a hierarchical model that has

parameters to describe each individual, and parame-ters to describe the overall tendencies of the bulimicand normal groups. The hierarchy is analogous tothe baseball example discussed earlier: Just as indi-vidual players were nested within fielding positions,here individual observers are nested within bulimicsymptom groups. (One difference, however, is thatwe do not build an overarching distribution acrossbulimic-symptom groups because there are only twogroups.) With this hierarchy, we express our priorexpectation that bulimic women are similar but notidentical to each other, and nonbulimic women aresimilar but not identical to each other, but the twogroups may be different.

The hierarchical model allows the parameterestimates for an individual observer to be rationallyinfluenced by the data from other individualswithin their symptom group. In our model,the individual attention weights are assumed tocome from an overarching distribution that ischaracterized by a measure of central tendencyand of dispersion. The overarching distributionsfor the high-symptom and low-symptom groupsare estimated separately. As the attention weightswi are constrained to range between 0 and 1,




we assume the parent distribution for the wi’s is

a beta distribution, parameterized by mean μ[g]w

and precision κ[g]w , where [g] indexes the group

membership (i.e., high symptom or low symptom).The individual sensitivities, ci, are also assumed tocome from an overarching distribution. Since thesensitivities are non-negative, a gamma distributionis a convenient parent distribution, parameterized

by mode mo[g]c and standard deviation σ

[g]c , where

[g] again indicates the group membership (i.e.,high symptom or low symptom). The group-

level parameters (i.e., μ[g]w , mo

[g]c , κ

[g]w and σ

[g]c )

are assumed to come from vague, noncommittaluniform distributions. There are 84 parametersaltogether, including wi and ci for 38 observersand the 8 group level parameters. Figure 13.6summarizes the hierarchical model in an integrateddiagram. The caption provides details.

The parameters of most interest are the group-

level attention to body size, μ[g]w , for g ∈{low,high}.

Other meaningful questions could focus on therelative variability among groups in attention,

which would be addressed by considering the κ[g]w

parameters, but we will not pursue these here.

Results: Interpreting the PosteriorDistribution

The Bayesian hierarchical approach to estima-tion yields attention weights for each observer,informed by all the other observers in the group.At the same time, it provides an estimate ofthe attention weight at the group level. Further,for every individual estimate and the group levelestimates, a measure of uncertainty is provided,in the form of a credible interval (95% HDI),which can be used as part of a decision rule todecide whether or not there are credible differencesbetween individuals or between groups.

The MCMC process used 3 chains with a total of100,000 steps after a burn-in of 4,000 steps. It pro-duced a smooth (converged) representation of the84-dimensional posterior distribution. We use theMCMC sample as an accurate and high-resolutionrepresentation of the posterior distribution.

check of robustness against changes intop-level prior constants

We conducted a sensitivity analysis by usingdifferent constants in the top-level uniform distri-butions, to check whether they had any notableinfluence on the resulting posterior distribution.

unif unif

betaμw, κw

moc, σc

gamma

Bernoulli

MPM (t, wi, ci)

pi (X/t)

Xt/i

unif unif

...i

...i

...t/i

=

...i

Fig. 13.6 The hierarchical model for attention allocation. Atthe bottom of the diagram, the classification data are denoted asXt|i = 1 if observer i says “X ” to target t , and Xt|i = 0 otherwise.The responses come from a Bernoulli distribution that has itssuccess parameter determined by the MPM, as defined in Eqs. 3,4, and 5 in the main text. The ellipsis on the arrow pointingto the response indicates that this relation holds for all targetswithin every individual. Scanning up the diagram, the individualattention parameters, wi , come from an overarching group-levelbeta distribution that has mean μw and concentration κw (henceshape parameters of aw = μwκw and bw = (1−μw)κw, as wasindicated explicitly for the beta distributions in Figure 13.1). Theindividual sensitivity parameters ci come from an overarchinggroup-level gamma distribution that has mode moc and standarddeviation σc (with shape and rate parameters that are algebraiccombinations of moc and σc ; see Kruschke, 2015, Section 9.2.2).The group-level parameters all come from noncommittal, broaduniform distributions. This model is applied separately to thehigh-symptom and low-symptom observers.

Whether all uniform distributions assumed anupper bound of 10 or 50, the results were essentiallyidentical. The results reported here are for an upperbound of 10.

comparison across groups of attention tobody size

Figure 13.7 shows the marginal posterior dis-tribution for the group-level parameters of mostinterest. The left side shows the distribution of thecentral tendency of attention to body size for eachgroup as well as the distribution of their difference.In particular, the bottom left histogram shows thatthe low-symptom group has an attention weight onbody size about 0.36 lower than the high-symptom




0.0 0.2 0.4 0.6 0.8 1.0

mode = 0.526

95% HDI0.335 0.718

μhighw

0.0 0.2 0.4 0.6 0.8 1.0

mode = 0.919

95% HDI

−0.8 −0.6 −0.4 −0.2 0.0

mode = −0.36

99.9% < 0 < 0.1%

95% HDI−0.583 −0.15

molowc

0.0 0.5 1.0 1.5

mode = 0.606

95% HDI0.0637 0.851

0.0 0.5 1.0 1.5

mode = 0.573

95% HDI0.4320.9760.79 0.685

molowc mohigh

c−μloww μhigh

w−−0.5 0.0 0.5 1.0

mode = 0.0347

55.1% < 0 < 44.9%

95% HDI−0.505 0.325

μloww

mohighc

Fig. 13.7 Marginal posterior distribution of group-level parameters for the prototype classification task. The left column shows the

group-level central tendency of the attention weight on body size, μ[g]w . The bottom-left histogram reveals a credibly nonzero difference

between groups, with low-symptom observers allocating about 0.36 less attention to body-size than high-symptom observers. The 95%HDI is so far away from a difference of zero that any reasonable ROPE would be excluded; therefore, we do not specify a particular

ROPE. The right column shows the group-level central tendency of the sensitivity parameter, mo[g]c . The bottom-right histogram shows

that zero difference is squarely among the most credible differences.

group, and this difference is credibly nonzero. Theright side shows that the most credible difference ofsensitivities is near zero.

The conclusions from our hierarchical Bayesianestimation agree with those of Viken et al. (2002),who took a non-Bayesian, nonhierarchical, model-comparison approach. We also find that high-symptom women, relative to low-symptom women,show enhanced attention to body size and decreasedattention to facial affect, but no differences in theirsensitivities. However, our hierarchical Bayesianestimation approach has provided explicit distri-butions on the credible differences between thegroups.

comparisons across individual women’sattention to body size

Although the primary question of interest in-volves the group-level central tendencies, hierarchi-cal Bayesian estimation also automatically providesestimates of the attention weights of individualwomen. Figure 13.8 shows the estimates of in-dividual attention weights wi for three women,based on the hierarchical Bayesian estimation thatshares information across all observers to inform theestimate of each individual observer. Figure 13.8also shows the individual estimates from a non-hierarchical MLE, which derives each individualestimate from the data of a single observer only.




w33

0.0 0.2 0.4 0.6 0.8 1.0w16

0.0 0.2 0.4 0.6 0.8 1.0w12

0.0 0.2 0.4 0.6 0.8 1.0

mode = 1

95% HDI0.821 1

mode = 0.999

95% HDI0.837 1

mode = 0.00012

95% HDI1.76e−53 0.172

Fig. 13.8 Posterior of attention weights wi of three individual observers. The vertical mark on the HDI indicates the MLE of theattention weight based on the individual’s data only. Observer 33 is a high-symptom woman, whose estimate is shrunk upward (towardone). Observers 16 and 12 are both low-symptom women, whose estimates are shrunken in different directions (upwards for 16,downwards for 12).

Figure 13.8 illustrates that in hierarchical mod-els, data from one individual influence inferencesabout the parameters of the other individuals.Technically, this happens because each woman’sdata influence the group-level parameters, whichaffect all the individual-level parameter estimates.For example, the hierarchical Bayesian modalestimate of the attention weight for observer 33, ahigh-symptom woman, is 1, which is larger thanthe nonhierarchical MLE of 0.89. This shrinkagein the hierarchical estimate is caused by the factthat most other high-symptom women tend to haverelatively high attention weights, thereby pullingup the group-level estimate of the attention weightand the estimates for each individual high-symptomwoman. Shrinkage also occurs for the low-symptomwoman, shown in the other panels of Figure 13.8.The second panel shows that for observer 16,the hierarchical Bayesian modal estimate of theattention weight is 1, which is higher than thenonhierarchical estimate of 0.93. For observer 12,however, shrinkage is in the opposite direction:the hierarchical Bayesian modal estimate of theattention weight is smaller than the MLE basedon individual data (0 vs 0.07). These oppositedirections in shrinkage of the estimate are caused bythe fact that the overarching beta distribution forlow-symptom women is bimodal (i.e., have shapeparameters less than 1.0), with one mode near 0 anda second mode near 1, indicating that low-symptomwomen tend to have either a low attention weight ora high attention weight. This bimodality is evidentin the data and is not merely an artifact of themodel, insofar as many women classify as if payingmost attention to either one dimension or the other.Woman with MLE’s close to 0 have hierarchicalBayesian estimates even closer to 0, whereas womanwith MLE’s close to 1 have hierarchical Bayesian

estimates even closer to 1. Shrinkage for the low-symptom women thus highlights that shrinkage isnot necessarily inward, toward the middle of thehigher-level distribution; it can also be outward,and always toward the modes.

Model Comparison as a Case of Estimationin Hierarchical Models

In the examples discussed earlier, Bayesianestimation was the reallocation of credibility acrossthe space of parameter values, for continuousparameters. We can think of each distinct pa-rameter value (or joint combination of values ina multiparameter space) as a distinct model ofthe data. Because the parameter values are ona continuum, there is a continuum of models.Under this conceptualization, Bayesian parameterestimation is model comparison for an infinity ofmodels.

Often, however, people may think of differentmodels as being distinct, discrete descriptions, noton a continuum. This conceptualization of modelsmakes little difference from a Bayesian perspective.When models are discrete, there is still a parameterthat relates them to each other, namely an indexicalparameter that has value 1 for the first model, 2for the second model, and so on. Bayesian modelcomparison is then Bayesian estimation, as thereallocation of credibility across the values of theindexical parameter. The posterior probabilities ofthe models are simply the posterior probabilities ofthe indexical parameter values. Bayesian inferenceoperates, mathematically, the same way regardlessof whether the parameter that relates models iscontinuous or discrete.

Figure 13.9 shows a hierarchical diagram forcomparing two models. At the bottom of the




categorical

Pm

...m

p(θ|m =1) p(ϕ|m =2)

p(D|ϕ, m =2)p(D|θ, m =1)

D

Fig. 13.9 Model comparison as hierarchical modeling. Eachdashed box encloses a discrete model of the data, and the modelsdepend on a higher-level indexical parameter at the top of thediagram. See text for further details.

diagram are the data, D. Scanning up the diagram,the data are distributed according to likelihoodfunction p(D|θ ,m= 1) when the model index mis 1. The likelihood function for model 1 involves aparameter θ , which has a prior distribution specifiedby p(θ |m=1). All the terms involving the parameterθ are enclosed in a dashed box, which indicates thepart of the overall hierarchy that depends on thehigher-level indexical parameter, m, having valuem = 1. Notice, in particular, that the prior onthe parameter θ is an essential component of themodel; that is, the model is not only the likelihoodfunction but also the prior. When m = 2, thedata are distributed according to the model onthe right of the diagram, involving a likelihoodfunction and prior with parameter φ. At the topof the hierarchy is a categorical distribution thatspecifies the prior probability of each indexicalvalue of m, that is, the prior probability of eachmodel as a discrete entity. This hierarchical diagramis analogous to previous hierarchical diagrams inFigures 13.1 and Figure 13.6, but the top-leveldistribution is discrete and lower-level parametersand structures can change discretely instead ofcontinuously when the top-level parameter valuechanges.

The sort of hierarchical structure diagrammedin Figure 13.9 can be implemented in the sameMCMC sampling software we used for the baseballand categorization examples earlier. The MCMCalgorithm generates representative values of the

indexical parameter m, together with representativevalues of the parameter θ (when m = 1) andthe parameter φ (when m = 2). The posteriorprobability of each model is approximated accu-rately by the proportion of steps that the MCMCchain visited each value of m. For a hands-onintroduction to MCMC methods for Bayesianmodel comparison, see Chapter 10 of Kruschke(2015) and Lodewyckx et al. (2011). Examplesof Bayesian model comparison are also providedby Vandekerckhove, Matzke, and Wagenmakers inchapter 14, this volume.

When comparing models, it is crucially im-portant to set appropriately the prior distributionswithin each model, because the estimation of themodel index can be very sensitive to the choice ofprior. In the context of Figure 13.9, we mean that itis crucial to set the prior distributions, p(θ |m=1)and p(φ|m = 2), so that they accurately expressthe priors intended for each model. Otherwise itis trivially easy to favor one model over the other,perhaps inadvertently, by setting one prior to valuesthat accommodate the data well while setting theother prior to values that do not accommodatethe data well. If each model comes with a theoryor previous research that specifically informs themodel, then that theory or research should beused to set the prior for the model. Otherwise,the use of generic default priors can unwittinglyfavor one model over the other. When there arenot strong theories or previous research to setthe priors for each model, a useful approach forsetting priors is as follows: Start each model withvague default priors. Then, using some modestamount of data that represent consensually acceptedprevious findings, update all models with thosedata. The resulting posterior distributions in eachmodel are then used as the priors for the modelcomparison, using the new data. The priors,by being modestly informed, have mitigated thearbitrary influence of inappropriate default priors,and have set the models on more equal playing fieldsby being informed by the same prior data. Theseand other issues are discussed in the context ofcognitive model comparison by Vanpaemel (2010)and Vanpaemel and Lee (2012).

A specific case of the hierarchical structure inFigure 13.9 occurs when the two models havethe same likelihood function, and hence the sameparameters, but different prior distributions. In thiscase, the model comparison is really a comparison oftwo competing choices of prior distribution for theparameters. A common application for this specific




case is null hypothesis testing. The null hypothesisis expressed as a prior distribution with all its massat a single value of the parameter, namely the “null”value, such as θ = 0. If drawn graphically, theprior distribution would look like a spike-shapeddistribution. The alternative hypothesis is expressedas a prior distribution that spreads credibility overa broad range of the parameter space. If drawngraphically, the alternative prior might resemble athin (i.e., short) slab-shaped distribution. Modelcomparison then amounts to the posterior prob-abilities of the spike-shaped (null) prior and theslab-shaped (alternative) prior. This approach tonull-hypothesis assessment depends crucially on themeaningfulness of the chosen alternative-hypothesisprior, because the posterior probability of thenull-hypothesis prior is not absolute but merelyrelative to the chosen alternative-hypothesis prior.The relative probability of the null-hypothesis priorcan change dramatically for different choices ofthe alternative-hypothesis prior. Because of thissensitivity to the alternative-hypothesis prior, werecommend that this approach to null-hypothesisassessment is used only with caution when it isclearly meaningful to entertain the possibility thatthe null value could be true and a justifiablealternative-hypothesis prior is available. In suchcases, the prior-comparison approach can be veryuseful. However, in the absence of such meaningfulpriors, null-value assessment most safely proceedsby explicit estimation of parameter values within asingle model, with decisions about null values madeaccording to the HDI and ROPE as exemplifiedearlier. For discussion of Bayesian approaches tonull-value assessment, see, for example, Kruschke(2011), Kruschke (2013, Appendix D), Moreyand Rouder (2011), Wagenmakers (2007), andWetzels, Raaijmakers, Jakab, and Wagenmakers(2009).

ConclusionIn this chapter we discussed two examples

of hierarchical Bayesian estimation. The baseballexample (Figure 13.1) illustrated multiple levelswith shrinkage of estimates within each level. Wechose this example because it clearly illustrates theeffects of hierarchical structure in rational inferenceof individual and group parameters. The catego-rization example (Figure 13.6) illustrated the use ofhierarchical Bayesian estimation for psychometricassessment via a cognitive model. The parametersare meaningful in the context of the cognitive

model, and Bayesian estimation provides a com-plete posterior distribution of credible parametervalues for individuals and groups. Other examplesof hierarchical Bayesian estimation can be found,for instance, in articles by Bartlema, Lee, Wetzels,and Vanpaemel (2014), Lee (2011), Rouder andLu (2005), Rouder, Lu, Speckman, Sun, and Jiang(2005), and Shiffrin, Lee, Kim, and Wagcnmakers(2008).

The hierarchical Bayesian method is very attrac-tive because it allows the analyst to define mean-ingfully structured models that are appropriate forthe data. For example, there is no artificial dilemmaof deciding between doing separate individualanalyses or collapsing across all individuals, whichboth have serious shortcomings (Cohen, Sanborn,& Shiffrin: 2008). When collapsing the dataacross participants in each group, it is implicitlyassumed that all participants within a group behaveidentically. Such an assumption is often untenable.The other extreme of analyzing every individualseparately with no pooling across individuals canbe highly error prone, especially when each par-ticipant contributed only small amounts of data.A hierarchical analysis provides a middle groundbetween these two strategies, by acknowledging thatpeople are different, without ignoring the fact thatthey represent a common group or condition. Thehierarchical structure allows information providedby one participant to flow rationally to the estimatesof other participants. This sharing of informationacross participants via hierarchical structure occursin both the classification and baseball examples ofthis chapter.

A second key attraction of hierarchical Bayesianestimation is that software for expressing complex,nonlinear hierarchical models (e.g., Lunn et al.2000, Plummer 2003, Stan Development Team2012), produces a complete posterior distributionfor direct inferences about credible parameter valueswithout need for p values or corrections for multiplecomparisons. The combination of ease of definingspecifically appropriate models and ease of directinference from the posterior distribution makeshierarchical Bayesian estimation an extremely usefulapproach to modeling and data analysis.

AcknowledgmentsThe authors gratefully acknowledge Rick Viken

and Teresa Treat for providing data from Viken,Treat, Nosofsky, McFall, and Palmeri (2002).Appreciation is also extended to E.-J.




Wagenmakers and two anonymous reviewers whoprovided helpful comments that improved the pre-sentation. Correspondence can be addressed to JohnK. Kruschke, Department of Psychological andBrain Sciences, Indiana University, 1101 E. 10thSt., Bloomington IN 47405-7007, or via electronicmail to [email protected]. Supplementary in-formation about Bayesian data analysis can befound at http://www.indiana.edu/∼kruschke/

Notes1. The most general definition of a confidence interval is the

range of parameter values that would not be rejected accordingto a criterion p value, such as p< 0.05. These limits depend onthe arbitrary settings of other parameters, and can be difficult tocompute.

2. Data retrieved December 22, 2012 fromhttp://www.baseball-reference.com/leagues/MLB/2012-standard-batting.shtml

3. This analysis was summarized athttp://doingbayesiandataanalysis.blogspot.com/2012/11/shrin-kage-in-multi-level-hierarchical.html

4. In the context of a normal distribution, instead of a betadistribution, the “precision” is the reciprocal of variance.Intuitively, it refers to the narrowness of the distribution foreither the normal or beta distributions.

GlossaryHierarchical model: A formal model that can be expressedsuch that one parameter is dependent on another parameter.Many models can be meaningfully factored this way, forexample when there are parameters that describe data fromindividuals, and the individual-level parameters depend ongroup-level parameters.

Highest density interval (HDI): The highest densityinterval summarizes the interval under a probability distri-bution where the probability densities inside the intervalare higher than probability densities outside the interval. A95% HDI includes the 95% of the distribution with thehighest probability density.

Markov chain Monte Carlo (MCMC): A class ofstochastic algorithms for obtaining samples from a prob-ability distribution. The algorithms take a random walkthrough parameter space, favoring values that have higherprobability. With a sufficient number of steps, the values ofthe parameter are visited in proportion to their probabilitiesand therefore the samples can be used to approximate thedistribution. Widely used examples of MCMC are theGibbs sampler and the Metropolis-Hastings algorithm.

Posterior distribution: A probability distribution over pa-rameters derived via Bayes’ rule from the prior distributionby taking into account the targeted data.

Prior distribution: A probability distribution over param-eters representing the beliefs, knowledge or assumptionsabout the parameters without reference to the targeted data.The prior distribution and the likelihood function togetherdefine a model.

Region of practical equivalence (ROPE): An intervalaround a parameter value that is considered to be equivalentto that value for practical purposes. The ROPE is used aspart of a decision rule for accepting or rejecting particularparameter values.

ReferencesBartlema, A., Lee, M D., Wetzels, R., & Vanpaemel,

W. (2014). A Bayesian hierarchical mixture approach toindividual differences: Case studies in selective attention andrepresentation in category learning. Journal of MathematicalPsychology, 59, 132–150.

Batchelder, W H. (1998). Multinomial processing tree modelsand psychological assessment. Psychological Assessment, 10,331–344.

Bayes, T., & Price, R. (1763). An essay towards solving aproblem in the doctrine of chances. By the Late Rev. Mr.Bayes, F.R.S. Communicated by Mr. Price, in a Letterto John Canton, A.M.F.R.S. Philosophical Transactions, 53,370–418. doi: 10.1098/rstl.1763.0053

Carlin, B P., & Louis, T. A. (2009). Bayesian methods for dataanalysis (3rd ed.). Boca Raton, FL: CRC Press.

Cohen, A. L., Sanborn, A. N., & Shiffrin, R, M. (2008). Modelevaluation using grouped or individual data. PsychonomicBulletin & Review, 15, 692–712.

Denwood, M. J. (2013). runjags: An R package providinginterface utilities, parallel computing methods and additionaldistributions for MCMC models in JAGS. Journal of Sta-tistical Software, (in review). http://cran.r-project.org/web/bpackages/runjags/

Doyle, A. C. (1890). The sign of four. London, England:Spencer Blackett.

Freedman, L. S., Lowe, D., & Macaskill, P. (1984). Stop-ping rules for clinical trials incorporating clinical opinion.Biometrics, 40, 575–586.

Hobbs, B. P., & Carlin, B. P. (2008). Practical Bayesian designand analysis for drug and device clinical trials. Journal ofBiopharmaceutical Statistics, 18(1), 54–80.

Kruschke, J. K. (2008). Models of categorization. R. Sun(Ed.), The Cambridge Handbook of Computational Psychology(p. 267–301). New York, NY: Cambridge UniversityPress.

Kruschke, J. K. (2011). Bayesian assessment of null values viaparameter estimation and model comparison. Perspectives onPsychological Science, 6(3) 299–312.

Kruschke, J. K. (2013). Bayesian estimation supersedes the ttest. Journal of Experimental Psychology: General, 142(2),573–603. doi: 10.1037/a0029146

Kruschke, J. K. (2015). Doing Bayesian data analysis, Secondedition: A tutorial with R, JAGS, and Stan. Waltham,Academic Press/Elsevier.

Lee, M. D. (2011). How cognitive modeling can benefitfrom hierarchical Bayesian models. Journal of MathematicalPsychology, 55, 1–7.

Lodewyckx, T., Kim, W., Lee, M. D., Tuerlinckx, F., Kuppens,P., & Wagenmakers, E. J. (2011). A tutorial on Bayesfactor estimation with the product space method. Journal ofMathematical Psychology, 55(5), 331–347.




Lunn, D., Jackson, C., Best, N., Thomas, A., & Spiegelhalter,D. (2013). The BUGS book: A practical introduction toBayesian analysis. Boca Raton, FL: CRC Press.

Lunn, D., Thomas, A., Best, N., & Spiegelhalter, D. (2000).WinBUGS — A Bayesian modelling framework: Concepts,structure, and extensibility. Statistics and Computing, 10(4),325–337.

Morey, R. D., & Rouder, J. N. (2011). Bayes factor approachesfor testing interval null hypotheses. Psychological Methods,16(4), 406–419.

Nosofsky, R. M. (1987). Attention and learning processes in theidentification and categorization of integral stimuli. Journalof Experimental Psychology: Learning, Memory and Cognition,13, 87–108.

Plummer, M. (2003). JAGS: A program for analysis of Bayesiangraphical models using Gibbs sampling. Proceedings of the 3rdInternational Workshop on Distributed Statistical Computing(DSC 2003). Vienna, Austria.

Reed, S. K. (1972). Pattern recognition and categorization.Cognitive Psychology, 3, 382–407.

Riefer, D. M., Knapp, B. R., Batchelder, W. H., Bamber, D.,& Manifold, V. (2002). Cognitive psychometrics: Assessingstorage and retrieval deficits in special populations withmultinomial processing tree models. Psychological Assessment,14, 184–201.

Rouder, J. N., & Lu, J. (2005). An introduction to Bayesianhierarchical models with an application in the theory ofsignal detection. Psychonomic Bulletin & Review, 12(4),573–604.

Rouder, J. N., Lu, J., Speckman, P., Sun, D., & Jiang,Y. (2005). A hierarchical model for estimating responsetime distributions. Psychonomic Bulletin & Review, 12(2),195–223.

Serlin, R. C.,& Lapsley, D. K. (1985). Rationality in psy-chological research: The good-enough principle. AmericanPsychologist, 40(1), 73–83.

Serlin, R. C., & Lapsley, D. K. (1993). Rational appraisalof psychological research and the good-enough principle.G. Keren, & C. Lewis (Eds.) A handbook for data analysis

in the behavioral sciences: Methodological issues (pp. 199–228). Hillsdale, NJ: Erlbaum.

Shiffrin, R. M., Lee, M. D., Kim, W., & Wagenmakers, E. J.(2008). A survey of model evaluation approaches with atutorial on hierarchical Bayesian methods. Cognitive Science,32(8), 1248–1284.

Smith, M. C., & Thelen, M. H. (1984). Development andvalidation of a test for bulimia. Journal of Consulting andClinical Psychology, 52, 863–872.

Spiegelhalter, D. J., Freedman, L. S., & Parmar, M. K. B. (1994).Bayesian approaches to randomized trials. Journal of the RoyalStatistical Society. Series A, 157, 357–416.

Stan Development Team. (2012). Stan: A C++ libraryfor probability and sampling, version 1.1. Retrieved fromhttp://mc-stan.org/citations.html

Vanpaemel, W. (2009). BayesGCM: Software for Bayesianinference with the generalized context model. BehaviorResearch Methods, 41(4), 1111–1120.

Vanpaemel, W. (2010). Prior sensitivity in theory testing:An apologia for the Bayes factor. Journal of MathematicalPsychology, 54, 491–498.

Vanpaemel, W., & Lee, M. D. (2012). Using priors toformalize theory: Optimal attention and the general-ized context model. Psychonomic Bulletin & Review, 19,1047–1056.

Viken, R. J., Treat, T. A., Nosofsky, R. M., McFall,R. M., & Palmeri, T, J. (2002). Modeling individualdifferences in perceptual and attentional processes relatedto bulimic symptoms. Journal of Abnormal Psychology, 111,598–609.

Wagenmakers, E. J. (2007). A practical solution to the pervasiveproblems of p values. Psychonomic Bulletin & Review, 14(5),779–804.

Wetzels, R., Raaijmakers, J. G. W., Jakab, E., & Wagenmakers,E. J. (2009). How to quantify support for and against thenull hypothesis: A flexible WinBUGS implementation of adefault Bayesian t test. Psychonomic Bulletin & Review, 16(4),752–760.




to cite this chapter: kruschke, j. k., & vanpaemel, …to cite this chapter: kruschke, j. k.,...

Documents