overview of main survey data analysis and scaling national research coordinators meeting madrid,...

Post on 17-Dec-2015

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Overview of Main Survey Data Analysis and

Scaling

National Research Coordinators Meeting Madrid, February 2010

NRCMeetingMadrid

February 2010

Content of presentation

• Scaling and analysis of test items

• Scaling and analysis of questionnaire items

• Data analysis for the reporting of ICCS data

NRCMeetingMadrid

February 2010

Steps in analysis

• Preliminary analysis of first data sets received– Review at JMC data analysis meeting in

Hamburg in July 2009

• Analysis of clean and uncleaned data sets from almost all participating countries– Review at PAC meeting in Tallinn (Oct 2009)

and JMC data analysis meeting in Hamburg in early December 2009

• Final scaling and analysis with clean data from all 38 countries

NRCMeetingMadrid

February 2010

Test item analysis

• Review of missing data

• Analysis of item dimensionality

• Review of item statistics (international)

• Analysis of differential item functioning by gender

• Analysis of item-by-country interaction– Measurement equivalence

• Item adjudication

NRCMeetingMadrid

February 2010

Scaling model

• Rasch one-parameter model

• Pi() is the probability for person n to score 1 on item i

n is the estimated ability of person n and i

P

i

n i

n i

( )exp

exp

1

NRCMeetingMadrid

February 2010

Probability curves

0

0.5

1

-4 -3 -2 -1 0 1 2 3 4

NRCMeetingMadrid

February 2010

Partial credit model

• For open-ended items (and questionnaire items) with more than two categories the Partial Credit model was used:

• Here, tij denotes an additional step parameter

iim

h

k

kijin

x

kijin

x mxPii

,,1,0

)(exp

)(exp)(

0 0

0

NRCMeetingMadrid

February 2010

Threshold curves

THETA

4.003.002.001.00.00-1.00-2.00-3.00-4.00

Pro

babi

lity

1.0

.8

.6

.4

.2

0.0

Strongly Agree Agree Disagree Strongly disagree

1

2 3

NRCMeetingMadrid

February 2010

THETA

4.003.002.001.00.00-1.00-2.00-3.00-4.00

Pro

babi

lity

1.0

.8

.6

.4

.2

0.0

1

2 3

Strongly Agree Agree Disagree Strongly disagree

Response probabilities

NRCMeetingMadrid

February 2010

Missing data issues

• Different categories of missing data

• Omitted responses– Somewhat higher percentages for open

response items

• Invalid responses– Generally very low percentages

• Not reached responses– Omitted items at end of test booklets– Generally low, in few countries more

considerable

NRCMeetingMadrid

February 2010

Not reached % by region

NRCMeetingMadrid

February 2010

Test characteristics

• Test items were generally a little easier than the average student abilities (pooled across countries)

• Test reliability was 0.84 (similar to CIVED assessment)

• Very high latent correlations between possible sub-dimensions– Decision not to pursue sub-scales

NRCMeetingMadrid

February 2010

Mapping of test items to

abilities

| | X| 2 X| XX| XX| XX|14 XX| XXX|25 XXXX| XXXXX| 1 XXXXX| XXXXXXX| XXXXX| XXXXXX|5 37 XXXXXXX|46 75 XXXXXXXX|2 4 49 56 74 XXXXXXXXX|51 71 XXXXXXXXXX|6 9 28 0 XXXXXXXX|55 59 77 XXXXXXXXX|10 32 34 68 XXXXXXXX|19 27 44 XXXXXXXXX|20 30 33 36 65 72 XXXXXXXXX|26 40 41 50 53 57 58 61 70 XXXXXXXXX|11 16 18 21 23 31 64 66 69 XXXXXXXX|12 17 35 76 78 XXXXXXXX|29 XXXXX|3 38 43 47 63 -1 XXXXX|7 13 15 22 48 62 67 79 XXXXX|1 52 54 XXXX|42 60 73 XXX|24 39 XXX| XX|45 X| X| -2 X| X| |8 | | |

NRCMeetingMadrid

February 2010

Review of item scaling properties

• Most items had excellent scaling properties– Weighted mean square item fit– Item-total correlation– Item characteristic curves

• Only on test item (CI2HRM2) was omitted from scaling

NRCMeetingMadrid

February 2010

Item statistics

Item 37 ------- item:37 (CI2HRM2) Cases for this item 7574 Item-Rest Cor. 0.09 Item Threshold(s): -0.37 Weighted MNSQ 1.23 Item Delta(s): -0.37 ------------------------------------------------------------------------------ Label Score Count % of tot Pt Bis t (p) PV1Avg:1 PV1 SD:1 ------------------------------------------------------------------------------ 1 0.00 222 2.93 -0.15 -13.31(.000) -0.78 0.85 2 1.00 4401 58.11 0.09 7.44(.000) 0.12 0.90 3 0.00 449 5.93 -0.12 -10.71(.000) -0.43 0.83 4 0.00 2392 31.58 0.05 4.23(.000) 0.00 0.86 7 0.00 45 0.59 -0.07 -5.92(.000) -0.97 1.08 9 0.00 65 0.86 -0.06 -4.80(.000) -0.52 0.73 ==============================================================================

NRCMeetingMadrid

February 2010

Item characteristic curves

NRCMeetingMadrid

February 2010

Scoring reliabilities - 1

• Open-ended items were scored according to international scoring guidelines

• Double-scoring of sub-samples

• On average, percentages of scorer agreement ranged between 84 and 92 across participating countries

NRCMeetingMadrid

February 2010

Scoring reliabilities - 2

• Only items accepted where scorer agreement was 70% or more

• Data for items where this criterion was not met were not included in scaling

• In two countries open-ended items were consistently easier than other items– Omitted from scaling and database

NRCMeetingMadrid

February 2010

Gender DIF

• DIF estimates reflect the differences between item difficulties for males and females of equal ability – This may cause bias in favour of one

group

• Generally, only few items with gender DIF were found

NRCMeetingMadrid

February 2010

Cross-national measurement equivalence

• Occurrence of item-by-country interaction– Items relatively much harder in some

countries but much easier in others

• In ICCS, national item calibrations were compared with those for the international calibration sample

• Standard errors were adjusted for sample design effects and multiple comparisons

NRCMeetingMadrid

February 2010

Example for CI2HRM2

NRCMeetingMadrid

February 2010

Item-by-country interaction

• Generally, items tended to behave in a similar way

• Number of items with parameter variance– Sometimes due to translation errors– Often due to other factors (national context,

curricula)

• Occurrence of some parameter variation across countries – Similar results as in other cross-national

studies

NRCMeetingMadrid

February 2010

Item adjudication

• Based on results from scaling analysis (item statistics, item curves, item-by-country interaction etc.)

• International item adjudication– Omission of CI2HRM2 from scaling

• National item adjudication– Re-verification for items with larger

discrepancies in item difficulty– Omission of item for national scaling with

translation or scoring issues

NRCMeetingMadrid

February 2010

Calibration of items

• Based on international calibration sample with 500 randomly selected students from each of the 36 participating countries that met sampling requirements

• ACER ConQuest was used for estimation

• Booklet effects adjusted by including booklet as a facet in the scaling model

NRCMeetingMadrid

February 2010

Scaling methodology

• Plausible values were generated as student ability estimates– More information at workshop!

• Dummy indicators for classroom and all student level variables (international and regional) were included in the conditioning model

• Scale scores set to international metric with mean of 500 and SD of 100 for equally weighted countries

NRCMeetingMadrid

February 2010

Estimation of changes in cognitive knowledge - 1

• 17 test items from CIVED included as intact cluster

• 17 countries with comparable data– Three countries with grade 9 in CIVED

and additional grade 9 samples in ICCS

• Small number of items in some countries had to be discarded due to translation errors or differences between ICCS and CIVED

NRCMeetingMadrid

February 2010

Estimation of changes in cognitive knowledge - 2

• Comparison of item parameters showed high similarity (correlation of 0.95)

• Slight positioning effect due to different test designs– CIVED: One booklet– ICCS: CIVED link cluster in each of the

three positions• CIVED items at beginning slightly easier, at

end slightly harder than in ICCS

NRCMeetingMadrid

February 2010

Estimation of changes in cognitive knowledge - 3

NRCMeetingMadrid

February 2010

Estimation of changes in cognitive knowledge - 4

• Framework broadened since CIVED– Re-scaling CIVED data to equate with

ICCS not appropriate

• Selection of CIVED items not representative for overall CIVED test– Equating link items with CIVED scale (or

sub-scale) also not appropriate

• Solution: Establish new comparison scale based only on 17 link items

NRCMeetingMadrid

February 2010

Estimation of changes in cognitive knowledge - 5

• Concurrent calibration of item parameters based on calibration samples with 34 samples from 17 countries (CIVED and ICCS)

• Establishing a metric with a mean of 500 and SD of 100 for equally weighted 17 CIVED countries

• For results in tables, weighted likelihood estimates were used– Usually unbiased for country averages

NRCMeetingMadrid

February 2010

Questionnaire item analysis

• Missing data issues

• Item dimensionality and scaling review

• Item/scale adjudication

• Scaling procedures

NRCMeetingMadrid

February 2010

Missing data - 1

• On average about 3 percent of students have missing scale scores– Only in two countries there are

percentages of 18 and 12 percent

• Teacher survey data relatively low missing percentages were found (about 2 percent)

• Very low percentages of missing data in school questionnaire

NRCMeetingMadrid

February 2010

Missing data - 2

• Concerns about missing data for socio-economic indicators– Highest parental occupation: 5%– Highest parental education: 3%– Books at home: 1%

• However, in a few countries higher percentages of missing data were found (up to 15% for parental education)

NRCMeetingMadrid

February 2010

Analysis of item dimensionality

• Exploratory and confirmatory factor analyses showed generally very similar results to those from the field trial

• These analyses will be described in detail in the ICCS technical report

NRCMeetingMadrid

February 2010

Scaling analysis

• Scale reliabilities (Cronbach’s alpha)– Over 0.7 satisfactory internal consistency

• Item-total correlations:– Useful for reviewing translation errors

• Scaling with IRT Partial Credit Model– Item fit– Category characteristic curves

NRCMeetingMadrid

February 2010

Item and scale adjudication

• Only three scales with median scale reliabilities below 0.7– Democratic value beliefs, civic

participation in community and at school

• Adjudication for student, teacher, school and each regional questionnaire

• Some items were removed from scale

• In some cases, single-item reporting

NRCMeetingMadrid

February 2010

Scaling procedures - 1

• IRT scaling with Partial Credit Model

• So-called weighted likelihood estimates as scale scores

• International metric with mean of 50 and a standard deviation of 10

0

)(exp

)exp(

2 1

0 0

0

i

k

jm

h

k

kijin

x

jijin

n

nx

iI

Jr

NRCMeetingMadrid

February 2010

Scaling procedures - 2

• Item parameter calibration with ACER ConQuest

• Calibration samples:– 500 students per country– 250 teachers per country– All school data with equal weights for

each country

• Only data from countries that met sampling requirements (categories 1 or 2) included in calibration

NRCMeetingMadrid

February 2010

Questionnaire scales

• Advantages of IRT scales– Inclusion of students with at least two

item responses per scale– Possibility to describe scale

• From IRT Partial Credit Model it is possible to map scale scores to expected item responses

• Item maps will be provided in appendix to international report

NRCMeetingMadrid

February 2010

Example of item map

Item

Item #1

Item #2

Item#3

Example of how to interpret the item-by-score map#1:

#2:

#3:

#4:

#5:

A respondent with score 60 has more than 50 probability to strongly agree with items 1 and at least agree with items 2 and 3

A respondent with score 40 has more than 50 % probability to strongly agree with items 1, 2 and 3

Example of item-by-score map

A respondent with score 30 has more than 50% probability to strongly agree with all three items

A respondent with score 40 has more than 50% probability to at least disagree with items 1 and 2 but to disagree with item 3

A respondent with score 50 has more than 50% probability to at least agree with items 1 and at least disagree with items 2 and 3

20 30 40 50 60 70 80

Scores

Strongly disagree Disagree Agree Strongly agree

NRCMeetingMadrid

February 2010

Data analysis for reporting

• Estimation of sampling variance

• Estimation of measurement variance

• Reporting of differences

NRCMeetingMadrid

February 2010

Estimation of sampling variance

• Data from cluster samples are not simple random samples– Standard formula for estimating sampling

error not appropriate

• Jackknife repeated replication technique used for ICCS

• IDB Analyser, WESVAR or SPSS/SAS macros may be used for applying this methodology

NRCMeetingMadrid

February 2010

Estimation of measurement variance

• Using plausible values allows estimating the measurement error– The variation between the five PVs can

be used for estimation

• IDB Analyser, WESVAR or SPSS macros (ACER replicates module) include features to do this

• More information will be provided at the training workshop on Wednesday

NRCMeetingMadrid

February 2010

NRCMeetingMadrid

February 2010

Reporting of differences - 1

• The following types of significance tests will be reported:– For differences in population estimates between

countries– For differences between a country and the

international – in population estimates between subgroups within

countries.– For differences between population estimates in

ICCS and in CIVED (trend estimation)

NRCMeetingMadrid

February 2010

Reporting of differences - 2

• Adjustment for multiple comparisons with Dunn-Bonferroni method– increasing critical value (p> .05) from

1.96 to 3.189

• SE for differences between samples

• Estimation of SE for sub-group differences with JRR

22_ jiijdif SESESE

NRCMeetingMadrid

February 2010

Reporting of differences - 3

• For the SE of trend differences it is important to take the equating error into account

• The estimation of SE for differences between CIVED and ICCS can be computed as

• The equating error in the international metric is 3.31

222_ EqErrSESESE jiijdifCIVEDICCS

NRCMeetingMadrid

February 2010

Multivariate analysis

• Multiple regression models were used for the tables in draft Chapter 7– Bivariate regression– Multiple regression

• Multi-level models were used for the analysis in draft Chapter 8– Students nested within classrooms– Classrooms mostly equivalent to schools

NRCMeetingMadrid

February 2010

Questions or comments?

top related