bayesian analysis of fmri data with spatial priors · · 2005-12-20bayesian analysis of fmri data...

Bayesian Analysis of fMRI data with Spatial Priors

Will Penny and Guillaume FlandinWellcome Department of Imaging Neuroscience,

University College, London WC1N 3BG.

KEY WORDS: Bayesian, fMRI, spatial prior,GLM, variational

1. Introduction

Functional Magnetic Resonance Imaging (fMRI) us-ing Blood Oxygen Level Dependent (BOLD) con-trast is an established method for making inferencesabout regionally specific activations in the humanbrain [7]. From measurements of changes in bloodoxygenation one can use various statistical models,such as the General Linear Model (GLM) [8], tomake inferences about task-specific changes in un-derlying neuronal activity.

In previous work [21, 23, 22] we have developed aspatially regularised General Linear Model (GLM)for the analysis of fMRI data which allows for thecharacterisation of regionally specific effects usingPosterior Probability Maps (PPMs). This spatialregularisation has been shown [23] to increase thesensitivity of inferences one can make.

This paper reviews our body of work on spatiallyregularised GLMs and describes two new develop-ments. These are (i) an approach for assessing mul-tivariate contrasts and (ii) a method for choosingthe thresholds that generate PPMs. The paper isstructured as follows. Section 2 reviews the theo-retical development of the algorithm. This includesa description of a Variational Bayesian algorithmin which inference is based on an approximationto the posterior distribution that has minimal KL-divergence from the true posterior. Sections 3 and 4describe the new approaches for assessing multivari-ate contrasts and PPM thresholding. In section 5we present results on null fMRI data, synthetic dataand fMRI from functional activation studies of au-ditory and face processing. The paper finishes witha discussion in section 6.

2. Theory

We write an fMRI data set consisting of T timepoints at N voxels as the T ×N matrix Y . In mass-univariate models [8], these data are explained interms of a T × K design matrix X, containing thevalues of K regressors at T time points, and a K×N

matrix of regression coefficients W , containing K re-gression coefficients at each of the N voxels. Themodel is written

Y = XW + E (1)

where E is a T ×N error matrix.It is well known that fMRI data is contaminated

with artifacts. These stem primarily from low-frequency drifts due to hardware instabilities, aliasedcardiac pulsation and respiratory sources, unmod-elled neuronal activity and residual motion artifactsnot accounted for by rigid body registration meth-ods [25]. This results in the residuals of an fMRIanalysis being temporally autocorrelated.

In previous work we have shown that, after re-moval of low-frequency drifts using Discrete CosineTransform (DCT) basis sets, low-order voxel-wiseautoregressive (AR) models are sufficient for mod-elling this autocorrelation [21]. It is important tomodel these noise processes as parameter estima-tion becomes less biased [11] and more accurate[21]. Together, DCT and AR modelling can accountfor long-memory noise processes. Alternative proce-dures for removing low-frequency drifts include theuse of running-line smoothers or polynomial expan-sions [17].

2.1 Model likelihood

We now describe the approach taken in our previouswork. For a P th-order AR model, the likelihood ofthe data is given by

p(Y |W,A, λ) =T∏

t=P+1

N∏n=1

N(ytn − xtwn; (2)

(dtn −Xtwn)T an, λ−1n )

where N(x;m,C) is a uni/multivariate Normal den-sity with mean m and variance/covariance C, n in-dexes the nth voxel, an is a P × 1 vector of au-toregressive coefficients, wn is a K × 1 vector of re-gression coefficients and λn is the observation noiseprecision. The vector xt is the tth row of the designmatrix and Xt is a P ×K matrix containing the pre-vious P rows of X prior to time point t. The scalarytn is the fMRI scan at the tth time point and nth

1

Figure 1: The figure shows the probabilistic depen-dencies underlying our generative model for fMRIdata. The quantities in square brackets are con-stants and those in circles are random variables. Thespatial regularisation coefficients α constrain the re-gression coefficients W . The spatial regularisationcoefficients β constrain the AR coefficients A. Theparameters λ and A define the autoregressive errorprocesses which contribute to the measurements.

voxel and dtn = [yt−1,n, yt−2,n, ..., yt−P,n]T . Becausedtn depends on data P time steps before, the likeli-hood is evaluated starting at time point P + 1, thusignoring the GLM fit at the first P time points.

Equation 2 shows that higher model likelihoodsare obtained when the prediction error ytn−xtwn iscloser to what is expected from the AR estimate ofprediction error.

The voxel wise parameters wn and an are con-tained in the matrices W and A and the voxel-wiseprecisions λn are contained in λ. The next sectiondescribes the prior distributions over these parame-ters. Together, the likelihood and priors define theprobabilistic generative model, which is portrayedgraphically in Figure 1.

2.2 Priors

The graph in Figure 1 shows that the joint proba-bility of parameters and data can be written

p(Y,W, A, λ, α, β) = p(Y |W,A, λ)p(W |α) (3)p(A|β)p(λ|u1, u2)p(α|q1, q2)p(β|r1, r2)

where the first term is the likelihood and the otherterms are the priors. The likelihood is given in equa-tion 2 and the priors are described below.

2.2.1 Regression coefficients

For the regressions coefficients we have

p(W |α) =K∏

k=1

p(wTk |αk) (4)

p(wTk |αk) = N(wT

k ; 0, α−1k D−1

w )

where Dw is a spatial precision matrix. This can beset to correspond to eg. a Low Resolution Tomog-raphy (LORETA) prior, a Gaussian Markov Ran-dom Field (GMRF) prior or a Minimum Norm (MN)prior (Dw = I) [9] as described in earlier work [23].These priors are specified separately for each sliceof data. Specification of 3-dimensional spatial pri-ors (ie. over multiple slices) is desirable from amodelling perspective but is computationally too de-manding for current computer technology.

We can also write wv = vec(W ), wr = vec(WT ),wv = Hwwr where Hw is a permutation matrix.This leads to

p(W |α) = p(wv|α) (5)= N(wv; 0, B−1)

where B is an augmented spatial precision matrixgiven by

B = Hw(diag(α)⊗Dw)HTw (6)

This form of the prior is useful as our specificationof approximate posteriors is based on similar quan-tities.

The above Gaussian priors underly GMRFs andLORETA and have been used previously in fMRI[26] and EEG [18]. They are by no means, how-ever, the optimal choice for imaging data. In EEG,for example, much interest has focussed on the use ofLp-norm priors [3] instead of the L2-norm implicit inthe Gaussian assumption. Additionally, we are cur-rently investigating the use of wavelet priors. Thisis an active area of research and will be the topic offuture publications.

2.2.2 AR coefficients

We also define a spatial prior for the AR coefficientsso that they too can be spatially regularised. Wehave

p(A|β) =P∏

p=1

p(ap|βp) (7)

p(ap|βp) = N(ap; 0, β−1p D−1

a )

2

Again, Da is a user-defined spatial precision matrix,av = vec(A), ar = vec(AT ) and av = Haar whereHa is a permutation matrix. We can write

p(A|β) = p(av|β) (8)= N(av; 0, J−1)

where J is an augmented spatial precision matrix

J = Ha (diag(β)⊗Da) HTa (9)

This form of the prior is useful as our specificationof approximate posteriors is based on similar quan-tities.

We have also investigated ‘Tissue-type’ priorswhich constrain AR estimates to be similar for vox-els in the same tissue-type eg. gray matter, whitematter or cerebro-spinal fluid. Bayesian model se-lection [22], however, favours the smoothly varyingpriors defined in equation 7.

2.2.3 Precisions

We use Gamma priors on the precisions α, β and λ

p(λ|u1, u2) =N∏

n=1

Ga(λn;u1, u2) (10)

p(α|q1, q2) =K∏

k=1

Ga(αk; q1, q2)

p(β|r1, r2) =P∏

p=1

Ga(βp; r1, r2)

where the Gamma density is defined as

Ga(x; b, c) =1

Γ(c)xc−1

bcexp

(−x

b

)(11)

Gamma priors were chosen as they are the conjugatepriors for Gaussian error models. The parametersare set to q1 = r1 = u1 = 10 and q2 = r2 = u2 = 0.1.These parameters produce Gamma densities with amean of 1 and a variance of 10. The robustness of,for example, model selection to the choice of theseparameters is discussed in [22].

2.3 Variational Bayes

The central quantity of interest in Bayesian learningis the posterior distribution p(θ|Y ). This implies es-timation both of the parameters θ and the uncertain-ties associated with their estimation. This can beachieved using standard Markov Chain Monte Carlo(MCMC) [12] procedures to produce samples fromthe posterior. Brain imaging data sets are, however,

prohibitively large (typically N = 50, 000, T = 200)making MCMC impractical for routine use. We havetherefore adopted an approximate inference proce-dure which is computationally efficient. This allowsinferences to be made within minutes rather thanhours. The approach is called Variational Bayes(VB), a full tutorial on which is given in [16]. Inwhat follows we describe the key features.

Given a probabilistic model of the data, the log ofthe ‘evidence’ or ‘marginal likelihood’ can be writtenas

log p(Y ) =∫

q(θ|Y ) log p(Y )dθ

=∫

q(θ|Y ) logp(Y, θ)p(θ|Y )

dθ

=∫

q(θ|Y ) log[q(θ|Y )p(Y, θ)p(θ|Y )q(θ|Y )

]dθ

= F + KL. (12)

Here, q(θ|Y ) is to be considered, for the moment, asan arbitrary density. We have

F =∫

q(θ|Y ) logp(Y, θ)q(θ|Y )

dθ, (13)

which is known (to physicists) as the negative vari-ational free energy and

KL =∫

q(θ|Y ) logq(θ|Y )p(θ|Y )

dθ (14)

is the KL-divergence [6] between the density q(θ|Y )and the true posterior p(θ|Y ).

Equation 12 is the fundamental equation of theVB-framework. Importantly, because the KL-divergence is always positive [6], F provides a lowerbound on the model evidence. Moreover, becausethe KL-divergence is zero when the two densitiesare the same, F will become equal to the model evi-dence when q(θ|Y ) is equal to the true posterior. Forthis reason q(θ|Y ) can be viewed as an approximateposterior.

The aim of VB-learning is to maximise F and somake the approximate posterior as close as possibleto the true posterior. To obtain a practical learningalgorithm we must also ensure that the integrals inF are tractable. One generic procedure for attainingthis goal is to assume that the approximating densityfactorizes over groups of parameters (in physics thisis known as the mean-field approximation). Thus,we consider:

q(θ|Y ) =∏

i

q(θi|Y ) (15)

= q(θi)q(θ\i)

3

where θi is the ith group of parameters and θ\i de-notes all parameters not in the ith group. The distri-butions which maximise F can then, via the calculusof variations, be shown to be [16]

q(θi|Y ) =exp[I(θi)]∫exp[I(θi)]dθi

(16)

where

I(θi) =∫

q(θ\i|Y ) log p(Y, θ)dθ\i (17)

Note that, importantly, this means we are ableto determine the optimal analytic form of the com-ponent posteriors (using equation 16). This is tobe contrasted with Laplace approximations wherewe fix the form of the component posteriors tobe Gaussian about the maximum posterior solution[20].

The above principles lead to a set of coupled up-date rules for the parameters of the component pos-teriors, iterated application of which leads to thedesired maximisation. Further, by computing F fordifferent models, we can perform model selection.This provides a mechanism for fine-tuning models.For example, the choice of hemodynamic basis set[22] or the order of the autoregressive models [21].

It is also possible to approximate the model evi-dence using sampling methods [12, 4]. In the verygeneral context of probabilistic graphical models,Beal and Ghahramani [4] have shown that the VBapproximation of model evidence is considerablymore accurate than the Bayesian Information Cri-terion (BIC) whilst incurring little extra computa-tional cost. Moreover, model selection using VB isof comparable accuracy to a much more computa-tionally demanding method based on Annealed Im-portance Sampling (AIS) [4].

2.4 Approximate Posteriors

This paper uses the Variational Bayes framework[19] for estimation and inference. We describe the al-gorithm developed in previous work [23] in which weassumed that the approximate posterior factorisesover voxels and subsets of parameters.

Because of the spatial priors, the regression coef-ficients in the true posterior p(W |Y ) will clearly becorrelated. Our perspective, however, is that thisis too computationally burdensome for current per-sonal computers to take account of. Moreover, aswe shall see in section 2.4.1, updates for our ap-proximate factorised densities q(wn) do encouragethe approximate posterior means to be similar at

nearby voxels, thereby achieving the desired effectof the prior.

Our approximate posterior is given by

q(W,A, λ, α, β) =∏n

q(wn)q(an)q(λn) (18)∏k

q(αk)∏p

q(βp)

and each component of the approximate posterior isdescribed below. These update equations are self-contained except for a number of quantities that aremarked out using the ‘tilde’ notation. These areAn, bn, Cn, dn and Gn which are all defined in Ap-pendix B of [21].

2.4.1 Regression coefficients

We have

q(wn) = N(wn; wn, Σn) (19)

Σn =(λnAn + Bnn

)−1

wn = Σn

(λnbT

n + rn

)rn = −

N∑i=1,i 6=n

Bniwi

where B is defined as in equation 6 but uses α in-stead of α. The quantities An and bn are expec-tations related to autoregressive processes and aredefined in Appendix B of [21]. In the absence oftemporal autocorrelation we have An = XT X andbTn = XT yn.

2.4.2 AR coefficients

We have

q(an) = N(an;mn, Vn)

where

Vn =(λnCn + Jnn

)−1

(20)

mn = Vn(λndn + jn)

jn = −N∑

i=1,i 6=n

Jnimi

and J is defined as in equation 9 but β is used insteadof β. The subscripts in Jni denote that part of Jrelevant to the nth and ith voxels. The quantities Cn

and dn are expectations that are defined in equation50 of [21].

4

2.4.3 Precisions

The approximate posteriors over the precision vari-ables are Gamma densities. For the precisions onthe observation noise we have

q(λn) = Ga(λn; bn, cn) (21)

1bn

=Gn

2+

1u1

cn =T

2+ u2

λn = bncn

where Gn is the expected prediction error definedin Appendix B of [21]. In the abscence of temporalautocorrelation we have

Gn = (yn −Xwn)T (yn −Xwn) + Tr(ΣXT X) (22)

For the precisions of the regression coefficients wehave

q(αk) = Ga(αk; gk, hk) (23)1gk

=12

(Tr(ΣkDw) + wT

k Dwwk

)+

1q1

hk =N

2+ q2

αk = gkhk

For the precisions of the AR coefficients we have

q(βp) = Ga(βp; r1p, r2p) (24)1

r1p=

12

(Tr(VpDa) + mT

p Damp

)+

1r1

r2p =N

2+ r2

βp = r1pr2p

2.5 Practicalities

For the empirical work in this paper we set up thespatial precision matrices Da and Dw, defined in sec-tion 2.2, to produce GMRF priors. We also used ARmodels of order P = 3.

The VB algorithm is initialised using OrdinaryLeast Square (OLS) estimates for regression and au-toregressive parameters as described in [21]. Quanti-ties are then updated using equations 19,20,21,23,24.These equations can then be iterated until conver-gence, which is defined as less than a 1% increase inF , the objective function. For the empirical work inthis paper, however, we fixed the number of itera-tions to 4.

Expressions for computing F are given in [22].This is an important quantity as it can also be used

for model comparison. This is described at lengthin [22].

The algorithm we have described is implementedin SPM version 5 and can be downloaded from [1].Computation of a number of quantites (eg. Cn, dn

and Gn) is now much more efficient than in previousversions [23]. These improvements are described in aseparate document [27]. To analyse a single sessionof data (eg. the face fMRI data) takes about 30minutes on a typical modern PC.

2.6 Spatio-temporal deconvolution

The central quantity of interest in fMRI analysis isour estimate of effect sizes, embodied in contrastsof regression coefficients. A key update equation inour VB scheme is, therefore, the approximate pos-terior for the regression coefficients. This is givenby equation 19. For the special case of temporallyuncorrelated data we have

Σn =(λnXT X + Bnn

)−1(25)

wn = Σn

(λnXT yn + rn

)where B is a spatial precision matrix and rn is theweighted sum of neighboring regression coefficientestimates.

This update indicates that the regression coeffi-cient estimate at a given voxel regresses towardsthose at nearby voxels. This is the desired effectof the spatial prior and it is preserved despite thefactoristion over voxels in the approximate posterior(see equation 18). Equation 25 can be thought ofas the combination of a temporal prediction XT yn

and a spatial prediction from rn. Each predictionis weighted by its relative precision to produce theoptimal estimate wn. In this sense, the VB up-date rules provide a spatio-temporal deconvolutionof fMRI data. Moreover, the parameters control-ling the relative precisions, λn and α are estimatedfrom the data. This means that our effect sizeestimates derive from an automatically regularisedspatio-temporal deconvolution.

2.7 Global scaling

Before statistical analysis, brain imaging data areusually scaled in some way. In the SPM softwarepackage [1], for example, imaging data are, by de-fault, scaled according to the following procedure.Firstly, the global mean value is computed

g =1

TN

T∑t=1

N∑n=1

unt (26)

5

where unt are fMRI values at voxel n and time pointt. Scaled images are then computed as

ynt =100g

unt (27)

If we express an activated voxel as

ua = ub + ∆u (28)

where ub is the baseline value and ∆u is the absoluteamount of activation then the difference in scaledimage data is then given by

ya − yb =100(ub + ∆u)

g− 100ub

g(29)

=100∆u

g

So, differences in scaled image values correspond tochanges in the original data that are expressed aspercentages of the global mean. Because estimatedregression coefficients are just estimates of differ-ences in scaled image values, eg. the difference be-tween conditions, then they too have this interpreta-tion. The regression coefficients in GLMs (and con-trasts of them, see next section) reflect effect sizes asa percentage of g. This is important as these effectsizes can be plotted in Posterior Probability Maps(see section 5).

3. Contrasts

After having estimated a model, we will be inter-ested in characterising a particular effect, c, whichcan usually be expressed as a linear function or ‘con-trast’ of parameters, w. That is,

c = CT w (30)

where C is a contrast vector or matrix. For exam-ple, the contrast vector cT = [1 − 1] will allow usto look at the difference between two experimentalconditions.

Our statistical inferences will be based on the ap-proximate distribution. Because this factorises overvoxels we can write

q(c) =N∏

n=1

q(cn) (31)

where cn is the effect size at voxel n. Given a con-trast matrix C we have

q(cn) = N(cn;mn, Vn) (32)

with mean and covariance

mn = Cwn (33)Vn = CT ΣnC

Bayesian inference based on this posterior can thentake place using confidence intervals [5]. For univari-ate contrasts we have suggested the use of PosteriorProbability Maps (PPMs). Before discussing thisat length in section 4, we describe a new approachthat allows us to make inferences about multivariatecontrasts. That is, where cn is a vector.

3.1 Multivariate contrasts

The probability α that the zero vector lies on the1− α confidence region of the posterior distributionat each voxel is computed as follows. We first notethat this probability is the same as the probabilitythat the vector mn lies on the edge of the 1 − αconfidence region for the distribution N(mn; 0, Vn).This latter probability can be computed by formingthe test statistic

dn = mTnV −1

n mn (34)

which will be the sum of vn = rank(Vn) indepen-dent, squared Gaussian variables. As such it has aχ2 distribution

p(dn) = χ2(vn) (35)

This procedure is identical to that used for makinginferences in Bayesian Multivariate AutoregressiveModels [13]. We can also use this procedure to testfor two-sided effects, that is, activations or deacti-vations. Although these contrasts are univariate wewill use the term ‘multivariate contrasts’ to also in-clude the assessment of these two-sided effects.

4. Thresholding

In previous work [9] we have suggested deriving Pos-terior Probability Maps (PPMs) by applying twothresholds to the posterior distributions (i) an ef-fect size threshold, γ, and (ii) a probability thresh-old pT . Voxel n is then included in the PPM ifq(cn > γ) > pT .

If voxel n is to be included then the posterior ex-ceedance probability q(cn > γ) is plotted. In thispaper, we instead propose plotting the effect size it-self, cn.

We also propose the following procedure for ex-ploring the posterior distribution of effect sizes.Firstly, plot a map of effect sizes using the thresholds

6

γ = 0 and pT = 1− 1/N where N is the number ofvoxels. We refer to these values as the default thresh-olds. Then, after visual inspection of the resultingmap use a non-zero γ, the value of which reflects ef-fect sizes in areas of interest. It will then be possibleto reduce pT to a value such as 0.95.

Of course, if previous imaging analyses have in-dicated what effect sizes are physiologically relevantthen this exploratory procedure is unnecessary. Al-ternatively, one could stick with the default thresh-olds.

4.1 False positive rates

If we partition effect-size values into two hypothesisspaces H0 : c ≤ γ and H1 : c > γ then we cancharacterise the sensitivity and specificity of our al-gorithm. This is different to classical inference whichuses H0 : c = 0. A False Positive (FP) occurs if weaccept H1 when H0 is true.

If we use the default threshold and the approxi-mate posterior were exact then the distribution ofFPs is binomial with rate p = 1/N . The mean andvariance of the binomial distribution are Np andNp(1−p). The expected number of false positives ineach PPM is therefore N × 1/N = 1. The varianceis N × 1/N × (1 − 1/N) which is approximately 1.We would therefore expect 0, 1 or 2 false positivesper PPM. We suggest that this is sufficiently highspecificity for brain imaging analyses.

Of course, the above result only holds if the ap-proximate posterior is equal to the true posterior.But given that all of our computational effort (seesection 2.4) is aimed at this goal it would not betoo surprising if the above analysis were indicativeof actual FP rates. This issue will be investigatedusing Null fMRI data in the next section.

5. Results

5.1 Null data

This section describes the analysis of a Null data setto find out how many false positives are obtainedusing PPMs with default thresholds.

Images were acquired from a 1.5TSonata(Siemens, Erlangen Germany) which pro-duced T2*-weighted transverse Echo-Planar Images(EPIs) with BOLD contrast. Whole brain EPIsconsisting of 48 transverse slices were acquired everyTR=4.32s resulting in a total of T = 98 scans. Thevoxel size is 3× 3× 3mm. All images were realignedto the first image using a six-parameter rigid-bodytransformation to account for subject movement.

Figure 2: Left: Design matrix for null fMRI data.The first column models a boxcar activation and thesecond column models the mean. There are n = 1..98rows corresponding to the 98 scans. Right: Imageof regression coefficients corresponding to syntheticactivation added to null data. In this image black is0 and white is 1.

We then implemented a standard whole volumeanalysis on images comprising N = 59, 945 voxels.We used the design matrix shown in the left panelof Figure 2. This design is used in the followingsection. Use of the default thresholds resulted in nospurious activations in the PPM. We then repeatedthe above analysis but with a number of differentdesign matrices.

These were based on the epoch design in Figure 2but each epoch onset was jittered by a number be-tween plus or minus 9 scans, sampled from a uniformdistribution, and the epoch durations were drawnfrom a uniform distribution between 4 and 10 scans.Ten such designs were created and VB models fittedto the null data. For designs 1 to 10, the numbersof false positives were 0,0,1,2,4,0,0,1,0 and 0. This isclose to what we’d expect from the binomial analysisdescribed in section 4.1.

5.2 Synthetic data

We then added three synthetic activations to a sliceof null data (z = −13mm). These were created usingthe design matrix and regression coefficient imageshown in Figure 2 (the two regression coefficient im-ages, ie. for the activation and the mean, were iden-tical). These images were formed by placing deltafunctions at three locations and then smoothing withGaussian kernels having FWHMs of 2, 3 and 4 pix-els (going clockwise from the top-left blob). Imageswere then rescaled to make the peaks unity.

In principle, smoothing with a Gaussian kernelrenders the true effect size greater than zero every-where because a Gaussian has infinite spatial sup-port. In practice, however, when implemented ona digital computer with finite numerical precisionmost voxels will be numerically zero. Indeed, oursimulated data contained 299 ‘activated’ voxels ie.

7

Figure 3: Left: Effect as estimated using VB Right:Effect as estimated using OLS. The true effect isshown in the right plot in Figure 2. In these im-ages, black denotes 0 and white 1.

Figure 4: Plots of exceedance probabilities p(cn > γ)for two thresholds. Left: γ = 0. Right: γ = 0.3. Inthese images, black denotes 0 and white 1.

voxels with effect sizes numerically greater than zero.This slice of data was then analysed using VB.

The contrast c = [1, 0]T was then used to look atthe estimated activation effect. This is shown inthe left panel of Figure 3. For comparison, we alsoshow the effect as estimated using OLS. Clearly, OLSestimates are much noisier than VB estimates.

Figure 4 shows plots of the exceedance probabil-ities for two different effect-size thresholds, γ = 0and γ = 0.3. Figure 5 shows thresholded versionsof these images. These are PPMs. Neither of thesePPMs contain any false positives. That is, the trueeffect size is greater than zero wherever a white voxeloccurs. This shows, informally, that use of the de-

Figure 5: PPMs for two thresholds. Left: The de-fault thresholds (γ = 0, pT = 1 − 1/N) Right: Thethresholds γ = 0.3, pT = 0.95. In these images,black denotes 0 and white 1.

Figure 6: Design matrix for analysis of the audi-tory data. The first column models epochs of au-ditory stimulation and the second models the meanresponse.

fault thresholds provides good specificity whilst re-taining reasonable sensitivity. Also, a combinationof non-zero effect-size thresholds and more liberalprobability thresholds can do the same.

5.3 Auditory data

This section describes the use of multivariate con-trasts for an auditory fMRI data set. This data setcomprises whole brain BOLD/EPI images acquiredon a modified 2T Siemens Vision system. Each ac-quisition consisted of 64 contiguous slices (64x64x643mm x 3mm x 3mm voxels). A time series of 96images was acquired with TR=7s from a single sub-ject.

This was an epoch fMRI experiment in whichthe condition for successive epochs alternated be-tween rest and auditory stimulation, starting withrest. Auditory stimulation was bi-syllabic words pre-sented binaurally at a rate of 60 per minute.

These data were analysed using VB with the de-sign matrix shown in Figure 6. To look for voxelsthat increase activity in response to auditory stimu-lation we used the univariate contrast c = [1, 0]T .Figure 7 shows a PPM that maps effect-sizes ofabove threshold voxels.

To look for either increases or decreases in activitywe use the multivariate contrast c = [1, 0]T . This in-ference uses the χ2 approach described earlier. Fig-ures 8 shows the PPM obtained using default thresh-olds.

8

Figure 7: PPM for positive auditory activation.Overlay of effect-size, in units of percentage of globalmean, on subjects MRI for above threshold voxels.The default thresholds were used, that is, we plot cn

for voxels which satisfy p(cn > 0) > 1− 1/N .

Figure 8: PPM for positive or negative auditory ac-tivation. Overlay of χ2 statistic on subjects MRIfor above threshold voxels. The default thresholdswere used, that is, we plot χ2

n for voxels which sat-isfy p(cn > 0) > 1− 1/N .

5.4 Face data

This is an event-related fMRI data set acquired byHenson et al. [15]. The data were acquired duringan experiment concerned with the processing of im-ages of faces [15]. This was an event-related studyin which greyscale images of faces were presented for500ms, replacing a baseline of an oval chequerboardwhich was present throughout the interstimulus in-terval. Some faces were of famous people and weretherefore familiar to the subject and others were not.Each face in the database was presented twice. Thisparadigm is a two-by-two factorial design where thefactors are familiarity and repetition. The four ex-perimental conditions are ‘U1’, ‘U2’, ‘F1’ and ‘F2’which are the first or second (1/2) presentations ofimages of familiar ‘F’ or unfamiliar ‘U’ faces.

Images were acquired from a 2T VISION sys-tem (Siemens, Erlangen, Germany) which producedT2*-weighted transverse Echo-Planar Images (EPIs)with BOLD contrast. Whole brain EPIs consistingof 24 transverse slices were acquired every two sec-onds resulting in a total of T=351 scans. All func-tional images were realigned to the first functionalimage using a six-parameter rigid-body transforma-tion. To correct for the fact that different slices wereacquired at different times, time series were interpo-lated to the acquisition time of the reference slice.Images were then spatially normalized to a standardEPI template using a nonlinear warping method [2].Each time series was then high-pass filtered usinga set of discrete cosine basis functions with a filtercut-off of 128 seconds.

The data were then analysed using the design ma-trix shown in Figure 9. The first 8 columns containstimulus related regressors. These correspond to thefour experimental conditions, where each stimulustrain has been convolved with two different hemo-dynamic bases (i) the canonical Hemodynamic Re-sponse Function (HRF) and (ii) the time derivativeof the canonical [14]. The next 6 regressors in thedesign matrix describe movement of the subject inthe scanner and the final column models the meanresponse.

The model was then fitted using the VB algo-rithm. Figure 10 plots a map of the first autoregres-sive component as estimated using VB. This shows agood deal of heterogeneity and justifies our assump-tion that that AR coefficients are spatially varying.The estimated spatial variation is smooth, however,due to the spatial prior.

Figure 11 shows a PPM for ‘Any effect of faces’.This was obtained using the multivariate contrastmatrix shown in Figure 9.

9

Figure 9: Lower part: Design matrix for analysis offace data, Upper part: Multivariate contrast used totest for any effect of faces.

Figure 10: Image of the first autoregressive coeffi-cient estimated from the face fMRI data. In theseimages, black denotes 0 and white 1.

Figure 11: PPM showing above threshold χ2 statis-tics for any effect of faces

10

6. Discussion

In previous work[10], we have compared the sensitiv-ity and specificity of classical inference to Bayesianinference with Minimum Norm (MN) priors. Thiscomparison was made possible by looking at the ex-pected properties of the Bayesian estimators acrossan ensemble of data sets, and showed that theBayesian inference was no more sensitive than theclassical inference.

Bayesian inferences can be used, however, to lookfor effects greater than a physiologically relevant sizeeg. 0.5% of the global mean. These inferences can bepresented visually using Posterior Probability Maps(PPMs) and have high intrinsic specificity. Also, onecan use Bayesian inference to assess the absence ofan experimental effect [10].

We have since developed a Bayesian inferenceframework based on spatial priors which has beenreviewed in this paper. This prior embodies ourknowledge that evoked responses are spatially homo-geneous and locally contiguous. The approach maybe viewed as an automatically regularized spatio-temporal deconvolution algorithm.

As compared to standard approaches based onspatially smoothing the imaging data itself, the spa-tial regularisation procedure has been shown to re-sult in inferences with higher sensitivity [23].

In this paper we have also described a new PPMprocedure for making inferences about multivariatecontrasts. This allows us to make inferences about(i) hemodynamic responses that are characterised bymultiple basis functions (ii) main effects and inter-actions in factorial fMRI designs and (iii) two-sidedeffects. The procedure uses the same χ2 approachthat has previously been used in the context of Mul-tivariate Autoregressive (MAR) models.

PPMs provide a visual representation of the pos-terior distribution of effect sizes across the brain.They are generated using an effect size thresholdand a probability threshold. These two thresholdsconvert a Gaussian posterior distribution at eachvoxel, specified by two quantities - the mean andvariance, into a single quantity that can be mappedeg. the exceedance probability, effect size or statisticvalue. One can create PPMs using various effect-sizethresholds, which can be chosen on the basis of priorknowledge about what is physiologically revelant ina given experimental context. One can also vary theprobability thresholds used, although 0.95 would bea typical value. The use of these different thresh-olds allows for a visual exploration of the posteriordistribution.

However, it is also useful to categorize responses

into regions which are or are not activated. Suchcategorization is usually unavoidable when report-ing neuroimaging results because one has to choosewhich regions to report in tables and, indeed, dis-cuss. In this paper we have therefore described asimple new procedure for setting the thresholds thatgenerate PPMs. These ‘default thresholds’ com-prise an effect size threshold of zero and a proba-bility threshold of 1 − 1/N where N is the numberof voxels in the volume. Use of PPMs with ‘defaultthresholds’ resulted in low false positive rates for nullfMRI data, and physiologically plausible activationsfor auditory and face fMRI data sets.

A comprehensive Bayesian thresholding approachhas been implemented by Woolrich et al. [24]. Thiswork uses explicit models of the null and alternativehypotheses based on Gaussian and Gamma variates.This requires a further computationally expensivestage of model-fitting, based on spatially regulariseddiscrete Markov Random Fields, but has the bene-fit that false-positive and true-positive rates can becontrolled explicitly.

Acknowledgements

W.D. Penny is funded by the Wellcome Trust. Wewould also like to thank Chloe Hutton for providingthe Null fMRI data and Karl Friston, Tom Nicholsand John Aston for commenting on earlier drafts ofthis manuscript.

References

[1] SPM, Wellcome Department of Imag-ing Neuroscience. Available fromhttp://www.fil.ion.ucl.ac.uk/spm/software/,2002.

[2] J. Ashburner and K.J. Friston. Spatial normal-ization using basis functions. In R.S.J. Frack-owiak, K.J. Friston, C. Frith, R. Dolan, K.J.Friston, C.J. Price, S. Zeki, J. Ashburner, andW.D. Penny, editors, Human Brain Function.Academic Press, 2nd edition, 2003.

[3] T. Auranen, A. Nummenmaa, M. Ham-malainen, I. Jaaskelainen, J. Lampinen, A. Ve-htari, and M. Sams. Bayesian analysis of theneuromagnetic inverse problem with lp normpriors. Neuroimage, 2005. In Press.

[4] M. Beal and Z. Ghahramani. The VariationalBayesian EM algorithm for incomplete data:with application to scoring graphical modelstructures. In J.M. Bernardo, M.J. Bayarri,

11

J.O. Berger, A.P. Dawid, D. Heckerman, A.F.Smith, and M. West, editors, Bayesian Statis-tics, volume 7. Oxford University Press, 2003.

[5] G.E.P Box and G.C. Tiao. Bayesian Inferencein Statistical Analysis. John Wiley, 1992.

[6] T.M. Cover and J.A. Thomas. Elements of In-formation Theory. John Wiley, 1991.

[7] R.S.J. Frackowiak, K.J. Friston, C. Frith,R. Dolan, C.J. Price, S. Zeki, J. Ashburner, andW.D. Penny. Human Brain Function. AcademicPress, 2nd edition, 2003.

[8] K.J. Friston, A.P. Holmes, K.J. Worsley, J.B.Poline, C. Frith, and R.S.J. Frackowiak. Sta-tistical parametric maps in functional imaging:A general linear approach. Human Brain Map-ping, 2:189–210, 1995.

[9] K.J. Friston and W.D. Penny. Posteriorprobability maps and SPMs. NeuroImage,19(3):1240–1249, 2003.

[10] K.J. Friston, W.D. Penny, C. Phillips, S.J.Kiebel, G. Hinton, and J. Ashburner. Classicaland Bayesian inference in neuroimaging: The-ory. NeuroImage, 16:465–483, 2002.

[11] T. Gautama and M.M. Van Hulle. Optimal spa-tial regularisation of autocorrelation estimatesin fMRI analysis. NeuroImage, (23):1203–1216,2004.

[12] A. Gelman, J.B. Carlin, H.S. Stern, and D.B.Rubin. Bayesian Data Analysis. Chapman andHall, Boca Raton, 1995.

[13] L. Harrison, W.D. Penny, and K.J. Friston.Multivariate autoregressive modelling of fMRItime series. NeuroImage, 19(4):1477–1491,2003.

[14] R.N.A. Henson. Analysis of fMRI time series.In R.S.J. Frackowiak, K.J. Friston, C. Frith,R. Dolan, K.J. Friston, C.J. Price, S. Zeki,J. Ashburner, and W.D. Penny, editors, HumanBrain Function. Academic Press, 2nd edition,2003.

[15] R.N.A. Henson, T. Shallice, M.L. Gorno-Tempini, and R.J. Dolan. Face repetition ef-fects in implicit and explicit memory tests asmeasured by fMRI. Cerebral Cortex, 12:178–186, 2002.

[16] H. Lappalainen and J.W. Miskin. EnsembleLearning. In M.Girolami, editor, Advancesin Independent Component Analysis. Springer-Verlag, 2000.

[17] J. Marchini and B. Ripley. A new statisticalapproach to detecting significant activation infunctional MRI. Neuroimage, 12:168–193, 2000.

[18] R. Pascual Marqui, C. Michel, and D. Lehman.Low resolution electromagnetic tomography: anew method for localizing electrical activity ofthe brain. International Journal of Psychophys-iology, pages 49–65, 1994.

[19] M.Beal. Variational Algorithms for Approxi-mate Bayesian Inference. PhD thesis, GatsbyComputational Neuroscience Unit, UniversityCollege London, 2003.

[20] J.J.K. O’Ruaniaidh and W.J. Fitzgerald. Nu-merical Bayesian Methods Applied to SignalProcessing. Springer, 1996.

[21] W.D. Penny, S.J. Kiebel, and K.J. Friston.Variational Bayesian Inference for fMRI timeseries. NeuroImage, 19(3):727–741, 2003.

[22] W.D. Penny and N. Trujillo-Bareto. Bayesiancomparison of spatially regularised general lin-ear models. Human Brain Mapping, 2005. Ac-cepted.

[23] W.D. Penny, N. Trujillo-Barreto, and K.J. Fris-ton. Bayesian fMRI time series analysis withspatial priors. NeuroImage, 24(2):350–362,2005.

[24] M. W. Woolrich, T.E.J. Behrens, C.J. Beck-man, and S. M. Smith. Mixture models withadaptive spatial regularisation for segmentationwith an application to fMRI data. IEEE Trans-actions on Medical Imaging, 14(6):1370–1386,December 2004.

[25] M. W. Woolrich, B. D. Ripley, M. Brady, andS. M. Smith. Temporal autocorrelation in uni-variate linear modelling of fMRI data. Neu-roImage, 14(6):1370–1386, December 2001.

[26] M.W. Woolrich, T.E. Behrens, and S.M. Smith.Constrained linear basis sets for HRF modellingusing Variational Bayes. NeuroImage, 21:1748–1761, 2004.

[27] W.Penny and G. Flandin. Bayesian analy-sis of single-subject fMRI: SPM implementa-tion. Technical report, Wellcome Departmentof Imaging Neuroscience, 2005.

12

bayesian analysis of fmri data with spatial priors · · 2005-12-20bayesian analysis of fmri data...

Documents