realcom multilevel models for realistically complex data measurement errors multilevel structural...

Post on 28-Mar-2015

220 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

REALCOM

Multilevel models for realistically complex data

Measurement errors

Multilevel Structural equations

Multivariate responses at several levels and of different types

Methodology and examples for:

An ESRC research project at Bristol University

General Format

• MATLAB software– Free standing executable programs– ASCII and worksheet input and output– Graphical menu based input specification– Model equation display– Monitoring of MCMC chains

• A training manual containing:– Outline of methodology– Worked through examples

Markov Chain Monte Carlo – a quick

introduction

• Bayesian simulation based method that, given starting values samples a new set of parameters at each cycle of a ‘Markov chain’

• This yields a final chain (after discarding a burn-in set) of, say, 5000 sets of values from the (joint) posterior distribution of the parameters

• This is formed by combining the likelihood based on the data and a prior distribution – typically diffuse.

• These chains are used for inference – e.g. the mean for a parameter is analogous to the point estimate from a likelihood analysis, intervals etc.

2 2

0 1 , ~ (0, ), ~ (0, )ij ij j ij j u ij ey x e Nu eNu

The parameters in this model are the fixed coefficients, the two variances and the level 2 residuals.

Consider the simple 2-level model:

From suitable starting values eventually the chain ‘settles down’ so that sampling is from the true posterior distribution and we need to sample sufficient to provide stable estimates – using suitable ‘convergence’ criteria.

All the MATLAB routines use MCMC sampling.

Measurement errors

1. Continuous variables: a simple example:• Basic model is:

• With a model of interest e.g.

0x x m

0 1

2 2~ (0, ), ~ (0, )

ij ij j ij

j u ij e

y u e

u e

x

N N

Some assumptions we need to make

• Variance assumed known – or alternatively• Reliability:• We also need a distribution for true value:

• An important issue is value for and sensitivity analysis useful – we can also give it a prior.

2m

2~ (0, )mm N

0 0

0 2 2 2 2 2( ) / , x x mx xR R x

2~ ( , )x xx N

2m

2. Missclassification errors

• Assume a binary (0,1) variable, for example whether or not a school pupil is eligible for free school meals

(yes=1) • Probability of observing a zero (no eligibility), given that

the true value is zero, is and the probability of observing a one given that the true value is zero by - likewise we have and

• We now assume we know these missclassification probabilities – similar target model as before with a binary predictor.

(0 | 0)obsP

(1| 0)obsP (0 |1)obsP (1|1)obsP

Modelling considerations• We can model multivariate continuous

measurement errors, but only independent binary missclassifications.

• We can allow different measurement error variances and covariances for different groups – e.g. gender.

• In multivariate case we typically need non-zero correlations between measurement errors:

•Thus, say, if R=0.7 observed correlation = 0.8 then we require measurement error correlation >0.33

1 1o o

m

R R

R R

An educational example

• Maths test score related to prior test scores and FSM eligibility.

• We will look at continuous, correlated and binary measurement errors.

Open measurement-error.exe and read file ‘classsize’

Summary table for analyses:

Factor analysis and structural equation models

i is the loading of the r th response on the single factor r

Consider a single level factor model where we have several responses on each member of a sample:

Where r indexes the response variable and i the person.

This is a special kind of multivariate model where we assume the residuals are independent and the covariance between two responses is thus given by

0

2 2~ (0, ), ~ (0, )

ri r r i ri

i ri er

y e

N e N

1 2

2r r

A constraint is needed for identifiability and the default is to choose 12

Extensions- further factors

We can add explanatory variables in addition to the

(see later) or we can add further factors:

0r

0 1 1 2 2

21 1

22 12 2

2

~

~ (0, )

ri r r i r i ri

i

i

ri er

y e

N

e N

As number of factors increases, we require further constraints, typically on loading values. A popular choice is ‘simple structure’ with each response loading on only 1 factor and non-zero correlations between factors.

Extensions – structural variables

We can allow the factors themselves to depend on further variables e.g.

2

0

*1 1

* * 2~ (0, ), ~ (0, )

ri r r i ri

i i i

i ri er

y e

x

N e N

Or alternatively, but less commonly

0 1 1

2 2~ (0, ), ~ (0, )

ri r r i r i ri

i ri er

y x e

N e N

Two level factor models

(1) (1) (2) (2)

0

(1) 2 (2) 2 2 2(1) (2)~ (0, ), ~ (0, ), ~ (0, ), ~ (0, )

rij r r ij r j rj rij

ij j rij er rj ur

y u e

N N e N u N

*

* 2 2

(1) (1)0

(1) (1) *

(1) * 2 * *(1)~ (0, ), ~ (0, ), ~ (0, )

rij r r ij rij

ij ij j

ij rij er j u

y e

u

N e N u N

Standard formulation

Alternatively

But we shall not consider this case

Example – PISA data

A survey of reading performance, of 15 year olds in 32 countries by OECD in 2000.

We use one subscale of 35 items ‘retrieving information’

and look at France and England.

First we shall fit one and two level models assuming responses are Normal – in fact they are binary and ordered but we come to that later.

Open structural-equation.exe load pisadata

Binary and ordered responses

Assume a binary response z. We will use the idea of a latent Normal distribution. Consider the (factor) model for a

single response:

Where we observe a positive (=1) response for our binary variable z if y is positive, that is

So that we obtain the probit model

0

0

( )

0

( )

Pr ob( 1) Pr ob( ( )) ( ) ( )r r i

r r i

ri r r iz e t dt t dt

0

2~ (0, ), ~ (0,1)

ri r r i ri

i ri

y e

N e N

0

0

0 or

( )ri r r i ri

ri r r i

y e

e

Ordered data

Consider the cumulative probability of being in one of the lowest s+1 categories of a p category variable - categories numbered from 0 upwards: s=0,…p-2

We extend the binary response model as:

Where the define a set of ‘thresholds’ for the categories.

So suppose we have a 3-category variable, then for observed responses

0

ss fri ri

f

γ

0

0~ (0,1), 0ri r sr r i ri

ri r

y e

e N

0r sr

0

0 0 1

0 1

0 if

1 if

2 if

ri r r i

ri r r i ri r r r i

ri r r r i

y

z y

y

PISA data with binary/ordered responses

• In fact all the responses are binary except for 4 with 3 ordered categories: C9, C14, C20, and C26

• Change these responses and rerun models.

•Finally fit explanatory variables Country and Gender in structural part of model.

Multivariate models with responses at 2 levels

• Consider first 2 Normal responses:

Superscript indicates level

• Models are linked via level 2 covariance matrix• MCMC algorithm handles missing response data and

categorical (binary, ordered and unordered) as well as Normal data.

• First example is a repeated measures growth curve model

(1) (1) (1) (1)1

(2) (2) (2)2

(1) (1) (2)1 2~ (0, ), ( , ) , ~ (0, )

ij ij j ij

j j j

Tij j j j j

y X u e

y X u

e MVN u u u u MVN

Child heights + adult height

2

2

2

(2) (2)0 0

(1) 2 3 (1) (1)0 1 2 3 0 1

(1)(1)00

(1) (1,1) (1) 21 2 2 01 1(2) (1,2) (1,2) (2)0 00 10 0

~ (0, ), , ~ (0, )

j j

ij ij ij ij j j ij ij

uj

j u u ij e

j u u u

y u

y t t t u u t e

u

u MVN e N

u

Child height as a cubic polynomial with intercept + slope random at level 2

Load growthdata.txt and fit the model

Results:

Two level growth model. Coefficient Estimate S.E. Level 1 model Intercept 153.05 0.69 Age (about age 13.0) 7.07 0.16 Age-squared 0.294 0.054 Age-cubed -0.208 0.029 Level 2 model Intercept 174.70 0.80 Level 2 covariance matrix

55.77 1.29 50.01

1.30 0.53 1.24

50.01 1.24 69.42

Level 1 variance 3.21

Adult height prediction

Suppose we have 2 growth measures: we want a regression prediction of the form

This leads to:

0 1 1 2 2j j j jy y y w (1) 2 30 1 2 3( ), 1, 2ij ij ij ij ijy y t t t i

2 2

2 2 2

1(1) (1) 2 (0,1) (1,2) (1,2)0 1 1 01 11 00 10 1

(1,2) (1,2)(1) (0,1) (1) (1) 2 (0,1)2 00 10 20 01 1 2 0 1 2 01 2

ˆ 2

u u j u j u u j

u u ju u j j u u j u j

t t t

tt t t t

Mixed response types and missing data

• Normal and ordered data already considered in structural equation models

• We now introduce unordered categorical responses

• We can also have general Normalising transformations

• Missing data via imputation is an important application for these models

Unordered categorical responses

We have where h indexes the response. For each we assume an underlying latent variable exists and that we have the following model:

For identifiability we model p-1 categories and assume .

The maximum indicant model: we observe category h for individual i iff .

so that

1 if response is in category for individual , 0 otherwisehiy h i

hiyhiv

1 1

T1 1 1 11 1

, ~ (0, )

is a correlation matrix, mutually independent vectors

is (1 ), is ( 1), is ( 1), { ,.... } , is ( 1)

hi hi h hi i

i

T Thi h i p

v X e e MVN

p p e

X s s e p ps

I

** and observe category if 0 hi hih i

v v h h p v h

* **

1 1[ ] hi hi h hi hi h h ipr X e X e h h

Assume p categories where an individual responds to just one.

Handling missing data

Multiple imputation – briefly and simply

Consider the model of interest (MOI)

We turn this into a multivariate response model

and obtain residual estimates of (from an MCMC chain) which are missing. Use these to ‘fill in’ and produce a complete data set. Do this (independently) n (e.g. = 20) times. Fit MOI to each data set and combine according to rules to get estimates and standard errors.

0 1i i iy x e

1 1

2 2

21 1

22 12 2

~ (0, ),

i i

i i

y e

x e

eN

e

1 2ˆ ˆ, i ie e

Class size example Load classsize_impute

MOI is Normalised exam score as response regressed on pretest score, gender, FSM, class size. 50% level 1 units have missing data. Multivariate model:

Table 6. Multivariate responses model fitted to data with 50% with missing data Variable Intercept (s.e.) Post maths 0.1336 (0.0708) Pre Maths 0.0321 (0.0713) Gender 0.0734 (0.0474) FSM -1.0898 (0.1293) Class size (-30) -4.0494 (0.5968) Level 1 covariance matrix

0.6918 0.4440 -0.0957 -0.1956

0.4440 0.7836 -0.1205 -0.1742

-0.0957 -0.1205 1.0000 -0.0119

-0.1956 -0.1742 -0.0119 1.0000

Level 2 covariance matrix 0.2147 0.1046 -0.0057 -0.0597 -0.1930

0.1046 0.2141 0.0185 -0.1404 0.0965

-0.0057 0.0185 0.0242 -0.0423 0.0151

-0.0597 -0.1404 -0.0423 0.6005 0.0109

-0.1930 0.0965 0.0151 0.0109 14.7433

MI estimates vs listwise deletion

Fixed effects in multivariate model: 50% records MCAR

Estimate Listwise (SE) MI (SE): Complete (SE)

Post maths 0.102 (0.088) 0.134 (0.071): 0.134 (0.070)

Pre Maths 0.011 (0.088) 0.032 (0.071): 0.019 (0.071)

Gender 0.096 (0.074) 0.073 (0.047): 0.069 (0.047)

FSM -1.124 (0.159) -1.090 (0.129): -1.064 (0.129)

Class size (-30) -4.030 (0.602) -4.049 (0.597): -4.267 (0.544)

Further extensions

• Box-Cox normalising transformations:

• Application to survival data treated as an ordered response when divided into discrete time intervals

• Combination of measurement errors, structural models and responses at >1 level into a single program

• Incorporation into MLwiN

1( 1)z y

General remarks

• Report back welcome (h.goldstein@bristol.ac.uk)

• A REALCOM discussion group is under consideration

Use with care!

top related