cfa and more in r!outline overview of lavaan and pisa data data screening in r, a brief overview...

CFA and More in R!

Menglin Xu([email protected])

Department of Educational Studies(QREM)

1

Outline

Overview of lavaan and PISA data

Data Screening in R, a brief overview

Confirmatory factor analysis (CFA)One-factor CFA, continuous vs ordinal data

Two-factor CFA

Measurement Invariance

Structural equation modeling (SEM)Structural model

Mediation model

2

lavaan

An R package aimed for latent variable analysis (Rosseel, 2012).

Regression

CFA

Path analysis

SEM

Publicly downloadable

Yield results comparable to Mplus (mimic=“mplus”)

3

lavaan Basic operators & utilities (please refer to Rosseel, 2017 for more details).

• y ~ x1 + x2 + x3 + x4 y is regressed on x1-x4 (works for both observed and latent Vs)

• F1 =~ y1 + y2 + y3 F1 is measured by y1-y3.

• y1 ~~ y2 error covariance of (y1, y2)

• y1 ~ 1 intercept of y1

• cfa() analyzing measurement models, multiple group.

• sem() analyzing models with structural paths.

4

Descriptions Syntax

PISA 2015 Student Data_U.S.

• The Programme for International Student Assessment (PISA)

• Data retrieved from http://www.oecd.org/pisa/data/2015database/

• Coordinated by the Organization for Economic Co-operation and Development (OECD)

• Spanning 35 OECD countries/regions, PISA assesses skills in reading, maths, and science with the focus rotating every three years.

• Targeting at 15-year-old secondary school students

• Assesses multiple cognitive, social, and emotional well-being (e.g., self-efficacy, belief, engagement).

• N = 5712 (2854 males) in science_15.dat

5

http://www.oecd.org/pisa/data/2015database/

Selected Variables

oMale (1: male; 0: female)

oParental support (pa_sup1 – pa_sup4), 4-point

oMotivation (mot1 – mot9, 5 for enjoyment, 4 for instrumental), 4-point

oScience efficacy (sci_eff1-sci_eff8), 4-point

oHome educational resources (HEDRES), continuous

oScience performance (PV1SCIE), continuous (M = 496, SD = 97.5)

oAll missing data is coded as 999.

6

Data Screening in RA Brief Practice

7

Set-upLoad the R package lavaan

install.packages("lavaan") ## only need to install once

library(lavaan) ## load the package each time

Define a working directory where the data is stored

setwd("C:/Users/xu.1384/Documents/lavaan workshop")

Read in data

## header=T: our data has variable names; na.string: our missingness is coded as 999.

science <- read.table("science_15.dat", header=T, na.string=999)

## to display the first six lines of data

head(science) ## everything looks OK?

8

Set-up ##in case the first column name is disordered

names(science)[1] <- "male"

9

Data Management

Basic summary

dim(science) ## to display the dimension

[1] 5712 24

summary(science) ## to produce summary information of each V

sapply(science, function(x) sum(is.na(x))/length(x)) ## to obtain missing rate

10

Data Management

To display variable distributions

par(mfrow=c(2,2)) ## tell R to display graphs in 2*2 format

pa <- science[, 2:5] ## to extract parental support items

sapply(pa, function(x) hist(x)) ## to exhibit the histogram of the 4 Vs

11

Data Management

How are the variables related with each other?

pairs(pa, panel=panel.smooth) ## to show pairwise correlation plot

12

Data Management

To get the bivariate correlations

cor(pa, use="complete.obs", method="pearson") ## to show the correlations

13

Confirmatory Factor Analysis

14

One-factor CFA Example

Parental support (4 items)

Item descriptions

pa_sup1: My parents are interested in my school activities.

pa_sup2: My parents support my educational efforts and achievements.

pa_sup3: My parents support me when I am facing difficulties at school.

pa_sup4: My parents encourage me to be confident.

15

CFA: a useful tool for measurement purposes.

To test how well the 4 items represent “parental support”.


R code for model fitting

## specify the one-factor CFA model, naming the latent factor to be pa_sup

pa.model <- 'pa_sup =~ pa_sup1 +pa_sup2 + pa_sup3 + pa_sup4’

## fit the model, fill in the model, data, and estimator;

model.pa <- cfa(model = pa.model, data = science, estimator = "MLR", mimic="mplus")

##Note. the naming in the left hand side of “<-” is flexible.

## to obtain the output for the analysis

summary(model.pa, fit.measures =TRUE, standardized=TRUE, rsquare=T)

16

One-factor CFA Example Yes, please refer to the “Robust” results

The Robust column

refers to the MLR

estimates, use this one

17


Estimates Scaling both latent

factor and DVs to be 1

Scaling latent

factor to be 1

18

One-factor CFA Example## to flexibly extract fit indices of interest

fitMeasures(model.pa, c("cfi.scaled", "tli.scaled","rmsea.scaled","srmr","aic","bic"))

## to extract unstandardized parameter estimates

parameterEstimates(model.pa)

## to obtain the standardized factor loadings only, use the following two steps:

std <- standardizedSolution(model.pa)

std[std$op=="=~","est.std"] ## std$op==“=~” means factor loadings

[1] 0.7167092 0.7942240 0.8349395 0.8230712

XX.scaled refers to

MLR estimates

19

When data is treated as ordinal

20

One-factor CFA_ordinal data• ## check frequency distribution of each V first.

sapply(pa, function(x) table(x))

21

To apply table(x) to all the Vs in pa

One-factor CFA_ordinal data## fit the same model while treating the variables as ordinal type

model.pa_cat <- cfa(model=pa.model, data=science, estimator = "WLSMV", mimic="mplus", ordered=c("pa_sup1","pa_sup2","pa_sup3","pa_sup4"))

## same procedures for summary and results extraction.

summary(model.pa_cat, fit.measures=TRUE, standardized=TRUE, rsquare=T)

fitMeasures(model.pa_cat, c("cfi.scaled", "tli.scaled","rmsea.scaled","srmr"))

std <- standardizedSolution(model.pa_cat) ## to obtain standardized estimates

std[std$op=="=~","est.std"] ## to extract standardized loadings

[1] 0.7945616 0.8855959 0.9049610 0.889216322

One-factor CFA

• Questions so far?

Highlights

Basic data management

Specify model in lavaan

Specify the fitting function, if data is ordinal, use “ordered=…”

summary() for model fit and estimates

Ways to extract specific information from output, e.g., fit indices, loadings.

23

Practice IScience efficacy (8 items, sci_eff1 – sci_eff8).

To create a “efficacy” data set

hint: efficacy <- science[, 20:27]

To make histogram plot for the 8 items

Hint: sapply(efficacy, function(x) hist(x))

Please fit a one-factor CFA model for sci_eff treating data as continuous

Does it fit well? All loadings significant?

Please fit the same model while treating data as ordinal.

Does it fit well? All loadings significant?

Please compare selected model fit indices (cfi.scaled, tli.scaled, rmsea.scaled, srmr), and standardized loadings for the two approaches. Which one is better?

24

Item Descriptions for Efficacy

sci_eff1 Recognise the science question that underlies a newspaper report on a health issue.

sci_eff2 Explain why earthquakes occur more frequently in some areas than in others.

sci_eff3 Describe the role of antibiotics in the treatment of disease.

sci_eff4 Identify the science question associated with the disposal of garbage.

sci_eff5 Predict how changes to an environment will affect the survival of certain species.

sci_eff6 Interpret the scientific information provided on the labelling of food items.

sci_eff7Discuss how new evidence can lead you to change your understanding about the

possibility of life on Mars.

sci_eff8 Identify the better of two explanations for the formation of acid rain.

25

To what extent I can…

Sample Output

Model fit comparison

Standardized loading comparison

26

Two-factor CFA Example

Science Motivation (9 items)

Item descriptions

27

mot1_enj I have fun when I am learning science

mot2_enj I like reading about science topics.

mot3_enj I am happy working on science topics.

mot4_enj I enjoy acquiring new knowledge in science.

mot5_enj I am interested in learning about science.

mot6_int

Making an effort in my science subject(s) is worth it because this will help me in the

work I want to do later on

mot7_int

What I learn in my science subject(s) is important for me because I need this for

what I want to do later on

mot8_int

Studying my science subject(s) is worthwhile for me because what I learn will

improve my career prospects.

mot9_int Many things I learn in my science subject(s) will help me to get a job.

Two-factor CFA Example

R code for model fitting

## specify the two-factor CFA model, naming the latent factor to be enjoy & instru

mot.2f <- 'enjoy =~ mot1_enj+mot2_enj+mot3_enj+mot4_enj+mot5_enj

instru =~ mot6_int+mot7_int+mot8_int+mot9_int'

## fit the model, fill in the model, data, and estimator;

model.mot_2f<-cfa(mot.2f, data=science, estimator = "MLR", mimic="mplus")

##Note. the naming in the left hand side of “<-” is flexible.

## to obtain the output for the analysis

summary(model.mot_2f, fit.measures=TRUE, standardized=TRUE, rsquare=T)

28

Two-factor CFA ExampleModel fit

Estimates

29

What’s the

inter-factor

correlation?

Practice II

Assuming continuous data, fit a one-factor CFA model to the motivation data (9 items);

Compare the one-factor vs two-factor CFA in terms of model fit and parameter estimates.

Which one is better?

Note. If estimator = “ML”, chi-square difference test between two nested models can be made by anova(model1, model2). For the practice, simply look at the respective output.

30

Sample Output

fit indices of one-factor vs two-factor CFA

31

Other Models Could be Considered

Bi-factor model: when it is posited that variables are explained by a single underlying construct, while there is uniqueness among groups of items.

32

The general

trait factor

The uniqueness for

intrinsic motivation

The uniqueness for

extrinsic motivation

Measurement Invariance (MI)

33

Measurement Invariance (MI)

Group differences on cognitive, psychological, and social traits are of popular interest in social sciences, e.g., Gender, race, ses differences on academic achievement;

Depression levels between clinical and non-clinical samples.

Levels of achievement and psychological well-being across Grade 1, 2, 3,….

It is assumed that scales are measuring the same construct across groups of interest/across time.

How to test it? (Meredith, 1993).Configural invariance: same loading patterns

Metric invariance: equal loadings

Scalar invariance: equal loadings + intercepts

Strict invariance: equal loadings + intercepts + error variances

(optional)34

Measurement Invariance (MI)Research question: is motivation (enjoy & instrumental) invariantly measured

across gender?

Configural Invariance: the two groups share the same loading patterns

Male Female

35

Configural Invariance

MI_1 <- cfa(mot.2f, data=science, group="male", estimator = "MLR", mimic="mplus")

fitMeasures(MI_1, c("chisq.scaled","df","cfi.scaled","tli.scaled","rmsea.scaled","srmr","bic"))

36

The fit indices refer to how good the

model fits the data assuming the groups

share the same factor structuresummary(MI_1,fit.measures=TRUE, standardized=TRUE)

Metric InvarianceMetric (Weak) Invariance: the two groups share the same structure + loadings

Male Female

37

1.0

λ31

λ41

λ51

λ21

λ72

λ82

λ92

1.0

λ21

λ31

λ41

λ51

1.0

1.0

λ72

λ82

λ92

Metric InvarianceMI_2 <- cfa(mot.2f, data=science, group="male", estimator = "MLR", mimic="mplus", group.equal=c("loadings"))

fitMeasures(MI_2, c("chisq.scaled","df","cfi.scaled", "tli.scaled","rmsea.scaled","srmr","bic"))

summary(MI_2,fit.measures=TRUE, standardized=TRUE)

38

Scalar Invariance

Scale (Strong) Invariance: the two groups share the same structure + loadings + intercepts

39

τ1τ1

τ2 τ2

τ9 τ9

All the intercepts are

equal across group

Scalar Invariance

MI_3 <- cfa(mot.2f, data=science, group="male", estimator = "MLR", mimic="mplus", group.equal=c("loadings", "intercepts"))

fitMeasures(MI_3, c("chisq.scaled","df","cfi.scaled","tli.scaled","rmsea.scaled","srmr","bic"))


40Latent mean differences can be told

Strict InvarianceStrict invariance: equal error variance on the basis of strong

invariance. Not mandatory.

MI_4 <- cfa(mot.2f, data=science, group="male", estimator = "MLR", mimic="mplus",

group.equal=c("loadings", "intercepts","residuals"))

fitMeasures(MI_4, c("chisq.scaled","df","cfi.scaled", "tli.scaled","rmsea.scaled","srmr","bic"))


41

Summary of MI Steps

Fit and changes in fit

42

Model χ2 df CFI TLI RMSEA SRMR BIC Δχ2 Δdf ΔCFI ΔTLI ΔRMSEA

Configural 380.761 52 0.986 0.980 0.048 0.013 78755.5

Metric 410.833 59 0.985 0.981 0.046 0.018 78714.7 -30.072 -7 0.001 -0.001 0.002

Scalar 441.294 66 0.984 0.982 0.045 0.018 78677 -30.461 -7 0.001 -0.001 0.001

Strict 415.363 75 0.985 0.986 0.040 0.020 78624.9 25.931 -9 -0.001 -0.004 0.005

Differences in the fit indices between

adjacent models: less constrained –

constrained model

e.g., this row is calculated as

Configural - Metric

Evaluation Criteria:

a) anova(model1, model 2) can be used for χ2 difference test if estimator = ML. When estimator = MLR, please refer to the

Mplus website: https://www.statmodel.com/chidiff.shtml

χ2 difference test is sensitive to sample size, so tentatively skipped.

b) Differences in CFI, TLI < .01, RMSEA < .015 indicates nonsignificant change (Cheung and Rensvold, 2002; Chen, 2007)

https://www.statmodel.com/chidiff.shtml

Summary of MI Steps

Implications of the MI results:

Motivation has same meaning across gender.

Latent means of enjoy & instru are comparable across gender.

Observed means are also comparable across gender.

What if MI is not satisfied at a certain step? => partial invariance (for details, please refer to Hirschfeld & von Brachel, 2014).

43

Practice III

Please run a series of MI models for science efficacy (sci_eff1 –sci_eff8, the one-factor CFA model) across gender.

χ2 difference test can be skipped right now.

No worry about partial invariance, just stop anywhere failing MI.

Could consider use the following table to help organize results.

44

Model CFI TLI RMSEA SRMR BIC ΔCFI ΔTLI ΔRMSEA

Configural

Metric

Scalar

Strict

Output for Reference

45

Structural Equation Modeling

46

SEM ExampleResearch question: what are the effects of gender (male), home

educational resources (HEDRES), and motivation (enjoy & instru) on science performance (PV1SCIE)?

Observed variables:

male, HEDRES, PV1SCIE.

Latent variables:

enjoy, instru

47

Structural

part

SEM Example

lavaan code

## specify the SEM model

sem1 <- 'enjoy =~ mot1_enj+mot2_enj+mot3_enj+mot4_enj+mot5_enj

instru =~ mot6_int+mot7_int+mot8_int+mot9_int

PV1SCIE ~ HEDRES + male + enjoy + instru’

## fit the model sem1

sem_1 <- sem(sem1, data=science, estimator = "MLR", mimic="mplus")

## get the selected fit indices

fitMeasures(sem_1, c("chisq.scaled","df","cfi.scaled", "tli.scaled","rmsea.scaled","srmr","bic"))

## to get the output for model fit and parameter estimates

summary(sem_1, fit.measures=TRUE, standardized=TRUE)48

SEM Example

Output

49

Check the p value for estimates,

anything nonsignificant?

Practice IV

Please delete the nonsignificant predictor(s) for PV1SCIE, re-run the SEM model

50

How does it fit?

SEM_Mediation

Research question: what is the effect of HEDRES on PV1SCIE mediated by motivation factors?

51

a1 b1

a2 b2

c

Direct effect:

c

Indirect effects:

a1*b1

a2*b2

Total effect

c + a1*b1+ a2*b2

SEM_Mediation Lavaan code

mediation <- 'enjoy =~ mot1_enj+mot2_enj+mot3_enj+mot4_enj+mot5_enj

instru =~ mot6_int+mot7_int+mot8_int+mot9_int

# direct effect

PV1SCIE ~ c*HEDRES

# mediator

enjoy ~ a1*HEDRES

PV1SCIE ~ b1*enjoy

instru ~ a2*HEDRES

PV1SCIE ~ b2*instru

# indirect effect (a*b)

a1b1 := a1*b1

a2b2 := a2*b2

# total effect

total := c + (a1*b1)+ (a2*b2)

# allow the two dimensions be correlated, not a default in sem() ###

enjoy~~instru’

med <- sem(mediation, data=science, estimator = "MLR", mimic="mplus")

fitMeasures(med, c("chisq.scaled","df","cfi.scaled", "tli.scaled","rmsea.scaled","srmr","bic"))

summary(med, fit.measures=TRUE, standardized=TRUE, rsquare=T) 52

In the fitting function, for bootstrap approach for se, use

sem(mediation, data=science, mimic="mplus“,

se = "boot", bootstrap = 10000)

SEM_Mediation

Output

53

a)How does the model fit?

b)How much variance in DV is explained?

c)Any interesting findings?

Practice VPlease fit a mediation model as shown in the following graph:

54

Exogenous: HEDRES

Mediator: science efficacy (latent)

Outcome: PV1SCIE

c

ab

Some References• Cheung, G.W., Rensvold, R.B.(2002). Evaluating goodness-of-fit indexes for testing measurement

invariance. Structural Equation Modeling, 9, 233–255.

• Chen, F.F. (2007). Sensitivity of goodness of fit indexes to lack of measurement invariance. Structural Equation Modeling. 14, 464–504.

• Hirschfeld, G., & von Brachel, R.(2014). Multiple-Group confirmatory factor analysis in R–A tutorial in measurement invariance with continuous and ordinal indicators. Practical Assessment, Research and Evaluation, 19(7), 1-12.

• Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58(4), 525-543.

• Organization for Economic Co-Operation and Development (OECD). Country Note: Key Findings from PISA 2015 for the United States; OECD Publishing: Paris, France, 2016; Available online: https://www.oecd.org/pisa/pisa-2015-United-States.pdf

• Rosseel, Y. (2012). lavaan: an R package for structural equation modeling. Journal of Statistical Software, 48, 1-36.

• Rosseel, Y. (2017). The lavaan tutorial. Retrieved from http://lavaan.ugent.be/tutorial/tutorial.pdf

55

https://www.oecd.org/pisa/pisa-2015-United-States.pdf

http://lavaan.ugent.be/tutorial/tutorial.pdf

Thanks for joining us!

Questions & Feedback

56