volume 8 2010 spaan fellowmichiganassessment.org/wp-content/uploads/2014/12/spaan... ·...

124
University of Michigan, English Language Institute M I C H I G A N E L I T E S T I N G VOLUME 8 2010 SPAAN FELLOW Working Papers in Second or Foreign Language Assessment

Upload: danghuong

Post on 10-Jun-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

University of Michigan, English Language Institute •MI C

H I G AN

EL I T E S T I N

G

VOLUME 8 2010

SPAANFELLOWWorking Papersin Second or Foreign Language Assessment

Page 2: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E
Page 3: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

Spaan Fellow Working Papers in Second or Foreign Language Assessment

Volume 8

2010

Edited by Jeff S. Johnson Eric Lagergren India Plough

In Memoriam

Jeffrey Stuart Johnson 1960–2010

It is with deep sadness that we dedicate this volume of the Spaan Fellow Working Papers to our colleague and friend Jeff Johnson. Jeff came to the University of Michigan’s Testing and Certification Division of the English Language Institute in 2001. He served as program manager of the MELAB and of Test Publications, as chair of the Spaan Fellowship Committee, and as editor of the Spaan Fellow Working Papers. Jeff was known by many in the field of Language Assessment for his attention to ethical testing practices and for his assistance in making archival ELI test data available for use by veteran researchers and young scholars alike. Our current (2010–2011) fellows had the good fortune of having their awards managed by Jeff. The four papers in this volume are written by our 2009–2010 fellows and mark the last Spaan Fellow papers that Jeff supervised. We miss him greatly.

Page 4: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

First Printing, September 2010 © 2010 by the English Language Institute, University of Michigan Spaan Committee Members: N. N. Chen, B. Dobson, J. S. Johnson, E. Lagergren, I. Plough, Production: E. Lagergren, B. Wood The Regents of the University of Michigan: Julia Donovan Darlow, Laurence B. Deitch, Denise Ilitch, Olivia P. Maynard, Andrea Fischer Newman, Andrew C. Richner, S. Martin Taylor, Katherine E. White, Mary Sue Coleman (ex officio).

Page 5: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

iii

Table of Contents

Spaan Fellowship Information.............................................................................iv Previous Volume Article Index ...........................................................................v Hyun Jung Kim

Investigating the Construct Validity of a Speaking Performance Test.......................................................................................1

Christine Goh S. Vahid Aryadoust

Investigating the Construct Validity of the MELAB Listening Test through the Rasch Analysis and Correlated Uniqueness Modeling ...............................................................................31

Kornwipa Poonpon

Expanding a Second Language Speaking Rating Scale for Instructional and Assessment Purposes .....................................69

Gad S. Lim

Investigating Prompt Effects in Writing Performance Assessment ...........95

Page 6: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

iv

The University of Michigan

SPAAN FELLOWSHIP FOR STUDIES IN SECOND OR FOREIGN LANGUAGE ASSESSMENT

In recognition of Mary Spaan’s contributions to the field of language assessment for more than three decades at the University of Michigan, the English Language Institute initiated the Spaan Fellowship Fund to provide financial support for those wishing to carry out research projects related to second or foreign language assessment. These fellowships were offered to cover the cost of data collection and analyses. Fellows submitted a paper at the end of the year to be published in the Spaan Fellow Working Papers. Some fellows made use of the English Language Institute’s resources to carry out a research project in second or foreign language assessment. These resources included the ELI Testing and Certification Division’s extensive archival test data (ECCE, ECPE, and MELAB) and the Michigan Corpus of Academic Spoken English (MICASE). For more information about previous Spaan fellows and volumes of the Spaan Fellow Working Papers, please visit our website.

www.lsa.umich.edu/eli/research/spaan

Page 7: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

v

Previous Volume Article Index

Development of a Standardized Test for Young EFL Learners. Fleurquin, Fernando 1, 1–23

A Construct Validation Study of Emphasis Type Questions in the Michigan English Language Assessment Battery. Shin, Sang-Keun

1, 25–37

Investigating the Construct Validity of the Cloze Section in the Examination for the Certificate of Proficiency in English. Saito, Yoko

1, 39–82

An Investigation into Answer-Changing Practices on Multiple-Choice Questions with Gulf Arab Learners in an EFL Context. Al-Hamly, Mashael, & Coombe, Christine

1, 83–104

A Construct Validation Study of the Extended Listening Sections of the ECPE and MELAB. Wagner, Elvis

2, 1–25

Evaluating the Dimensionality of the Michigan English Language Assessment Battery. Jiao, Hong

2, 27–52

Effects of Language Errors and Importance Attributed to Language on Language and Rhetorical-Level Essay Scoring. Weltig, Matthew S.

2, 53–81

Investigating Language Performance on the Graph Description Task in a Semi-Direct Oral Test. Xi, Xiaoming

2, 83–134

Switching Constructs: On the Selection of an Appropriate Blueprint for Academic Literacy Assessment. Van Dyk, Tobie

2, 135–155

Language Learning Strategy Use and Language Performance on the MELAB. Song, Xiaomei

3, 1-26

An Empirical Investigation into the Nature of and Factors Affecting Test Takers’ Calibration within the Context of an English Placement Test (EPT). Phakiti, Aek

3, 27–71

A Validation Study of the ECCE NNS and NS Examiners’ Conversation Styles from a Discourse Analytic Perspective. Lu, Yang

3, 73–99

An Investigation of Lexical Profiles in Performance on EAP Speaking Tasks. Iwashita, Noriko

3, 101–111

A Summary of Construct Validation of an English for Academic Purposes Placement Test. Lee, Young-Ju

3, 113–131

Toward a Cognitive Processing Model of the MELAB Reading Test Item Performance. Gao, Lingyun

4, 1–39

Validation and Invariance of Factor Structure of the ECPE and MELAB across Gender. Wang, Shudong

4, 41–56

Evaluating the Use of Rating Scales in a High-Stakes Japanese University Entrance Examination. Weaver, Christopher

4, 57–79

Detecting DIF across Different Language and Gender Groups in the MELAB using the Logistic Regression Method. Park, Taejoon

4, 81–96

Bias Revisited. Hamp-Lyons, Liz & Davies, Alan 4, 97–108

Do Empirically Developed Rating Scales Function Differently to Conventional Rating Scales for Academic Writing? Knoch, Ute

5, 1–36

Page 8: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

vi

Investigating the Construct Validity of the Grammar and Vocabulary Section and the Listening Section of the ECCE: Lexico-Grammatical Ability as a Predictor of L2 Listening Ability. Liao, Yen-Fen

5, 37–78

Lexical Diversity in MELAB Writing and Speaking Task Performances. Yu, Guoxing 5, 79–116

An Investigation of the Item Parameter Drift in the Examination for the Certificate of Proficiency in English (ECPE). Li, Xin

6, 1–28

Investigating the Invariance of the ECPE Factor Structure across Different Proficiency Levels. Römhild, Anja

6, 29–55

Investigating Proficiency Classification for the Examination for the Certificate of Proficiency in English (ECPE). Zhang, Bo

6, 57–75

Underlying Factors of MELAB Listening Constructs. Eom, Minhee 6, 77–94

Examining the Construct Validity of a Web-Based Academic Listening Test: An Investigation of the Effects of Response Formats. Shin, Sunyoung

6, 95–129

Investigating the Construct Validity of a Performance Test Designed to Measure Grammatical and Pragmatic Knowledge. Grabowski, Kirby

6, 131–179

Ratings of L2 Oral Performance in English: Relative Impact of Rater Characteristics and Acoustic Measures of Accentedness. Kang, Okim

6, 181–205

Conflicting Genre Expectations in a High-Stakes Writing Test for Teacher Certification in Quebec. Baker, Beverly A.

7, 1–20

Collaborating with ESP Stakeholders in Rating Scale Validation: The Case of the ICAO Rating Scale. Knoch, Ute

7, 21–46

Investigating Source Use, Discourse Features, and Process in Integrated Writing Tests. Gebril, Atta, & Plakans, Lia

7, 47–84

Investigating Different Item Response Models in Equating the Examination for the Certificate of Proficiency in English (ECPE). Song, Tian

7, 85–98

Decision Making in Marking Open-Ended Listening Test Items: The Case of the OET. Harding, Luke, & Ryan, Kerry

7, 99–113

Page 9: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

Spaan Fellow Working Papers in Second or Foreign Language AssessmentCopyright © 2010Volume 8: 1–30

English Language InstituteUniversity of Michiganwww.lsa.umich.edu/eli/research/spaan

1

Investigating the Construct Validity of a Speaking Performance Test Hyun Jung Kim Teachers College, Columbia University

ABSTRACT With the increased demand for the integration of a performance component in second language (L2) testing, speaking performance assessments have focused on eliciting examinees’ underlying language ability through their actual oral performance on a given task. Considering the nature of performance assessments, many factors other than examinees’ speaking ability are necessarily involved in the process of evaluation. Compared to the construct definition of speaking ability, however, relatively less attention has been given to tasks, which are regarded as a vehicle for assessment, although there is a growing interest in authentic tasks in eliciting real-world language samples for evaluation. Thus, the present study investigates whether a speaking placement test provides empirical evidence that the effect of task, as well as examinees’ attributes, should be considered in describing speaking ability in a performance assessment. An understanding of the underlying structure of the speaking placement test not only helps to identify the factors involved in the evaluation process and their relationships, but ultimately makes it possible to appropriately infer examinees’ speaking ability.

In L2 testing, the notion of performance first emerged in the 1960s in response to practical needs, and since then, the demand to integrate examinees’ actual performance in L2 assessment has increased (McNamara, 1996). Early testers who advocated the integration of a performance component focused on whether examinees could successfully fulfill a task in a simulated real-life language use context (e.g., Clark, 1975; Jones, 1985; Morrow, 1979; Savignon, 1972). McNamara (1996) classified this approach as a strong sense of performance assessment in which the definition of L2 ability construct is limited to examinees’ task completion.

On the contrary, new theories of communicative competence and communicative language ability in the 1980s and 1990s (e.g., Bachman, 1990; Bachman & Palmer, 1996, Canale, 1983; Canale & Swain, 1980) changed not only the perception of L2 language ability, but also the role of performance in language testing. They supported a weak sense of performance assessment (McNamara, 1996), in which the main interest was examinees’ language ability, instead of task completion. That is, L2 ability was determined based on various language components derived from the theoretical models of communicative competence and communicative language ability. Examinees’ actual performance was elicited for evaluation of language ability; however, the role of performance was limited to a vehicle

Page 10: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

32 H. J. Kim

to elicit examinees’ underlying language ability. This approach to performance assessment, called a construct-centered approach (Bachman, 2002), has been widely accepted by L2 testers for most general purpose language performance assessments (e.g., Brindley, 1994; Fulcher, 2003; Luoma, 2004; McNamara, 1996; Messick, 1994; Skehan, 1998). While the construct-centered approach to performance assessment gives priority to definitions of L2 ability, a different perspective has recently been proposed. A task-centered approach focuses on what examinees can do with the language; that is, whether they can fulfill a given task (Brown, Hudson, Norris, & Bonk, 2002; Norris, Brown, Hudson, & Yoshioka, 1998). Although this approach provides more systematic criteria for the evaluation of examinees’ task fulfillment than the approach of early testers who first argued for the integration of performance in language testing, it basically shares the early testers’ view about what performance assessments aim to measure (i.e., strong version of performance assessment). According to the task-centered approach, test contexts or tasks play a crucial role in measuring L2 ability because examinees’ performance is evaluated based on real-world criteria.

The two approaches to performance assessment appear to be contradictory in nature. Chapelle (1998), however, argued from an interactionalist perspective that both construct definitions and tasks should be considered together in defining L2 ability because the two interact during communication. As reviewed, different perspectives on L2 performance assessment have defined language ability distinctively with a different focus. What is important is not which approach is superior, but whether a test is validated before inferring examinees’ language ability from the test results. In other words, before an inference regarding an examinee’s language ability is made from test scores, test developers and users need to make sure what the test aims to measure (e.g., various language components, performance on tasks) and whether a test actually measures what it intends to measure. Although a test is designed for its intended purpose (e.g., following construct definitions, task characteristics, or both), there are still many factors that need to be considered in L2 performance assessments to understand examinees’ performance and define their language ability. Examinees’ performance may be affected by factors other than their language ability (McNamara, 1996, 1997). McNamara (1995) elaborated a schematic representation (Figure 1), which Kenyon (1992) first presented, to conceptualize the performance dimension of L2 speaking performance tests. As presented in the figure, examinees’ performance in L2 speaking tests is affected by many factors in the testing phase (i.e., candidates, tasks, interlocutors, and their interactions) as well as in the rating phase (i.e., raters and rating scales). Empirical studies have identified these factors that affect speaking performance test scores as effects of the: (1) candidate (Lumley & O’Sullivan, 2005; O’Loughlin, 2002); (2) task (Chalhoub-Deville, 1995; Clark, 1988; Elder, Iwashita, & McNamara, 2002; Farris, 1995; Malabonga, Kenyon, & Carpenter, 2005; Shohamy, 1994; Wigglesworth, 1997); (3) interlocutor (Brown, 2003; O’Sullivan, 2002); (4) rater (Barnwell, 1989; Bonk & Ockey, 2003; Brown, 1995; Eckes, 2005; Elder, 1993; Y. Kim, 2009; Lumley, 1998; Lumley & McNamara, 1995; Lynch & McNamara, 1998; Meiron & Schick, 2000; Orr, 2002; Wigglesworth, 1993); and (5) scale/criteria (M. Kim, 2001). It might be impossible to completely eliminate the effects of these factors on examinees’ speaking performance. However, it is important to understand relative contributions of these factors to examinees’ performance and test scores in order to better estimate examinees’ speaking ability and more appropriately interpret and use the test results.

Page 11: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

3Investigating the Construct Validity of a Speaking Performance Test

Rater

Scale/Criteria Score  

Performance

Interlocutor Task   (including other candidate)

Candidate

Figure 1. Interactions in Performance Assessment of Speaking Skills (McNamara, 1995, p. 173) To sum up, examinees’ speaking ability can be inferred only after a test is validated with respect to its constructs and other factors involved in the process of evaluation. The focus of previous studies, however, has often been limited to effects of individual factors on examinees’ test performance. In other words, speaking performance tests have not been examined in a big framework in which various factors (e.g., examinees’ language ability, tasks, and rating criteria) interact with one another. Moreover, performance tests, especially those which do not involve high stakes, are oftentimes used without such validation. To this end, the current study seeks to explore the nature of a speaking placement test, which has been locally used in a community English program. In order to determine whether the speaking test accurately measures speaking ability as intended, the underlying structure of the test is investigated in the present study. In other words, the question of whether the hypothesized components of speaking ability (reflected in the scoring rubric) actually function as the operationalized constructs of the test is examined. In addition, to better explain how the test works, the effects of other variables, such as rater perceptions and task characteristics, are also investigated. That is, factors that can have an effect on speaking performance are considered in addition to issues regarding construct definition.

Research Questions

The current study addresses the following three research questions: (1) What is the factorial structure of the speaking test? (2) To what extent does the speaking test measure the intended hypothesized constructs of speaking ability? (3) In addition to the measured variables, to what extent do other factors (i.e., raters and tasks) contribute to examinees’ speaking performance?

Page 12: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

54 H. J. Kim

Method Context of the Current Study The Community English Program (CEP) is an English as a second language (ESL) program offered by the Teaching English to Speakers of Other Languages (TESOL) and applied linguistics programs at Teachers College. The program targets adult ESL learners who wish to improve their communicative language ability. Therefore, the CEP curriculum emphasizes not only the various language components (grammar, vocabulary, and pronunciation) but also the different language skills (listening, speaking, reading, and writing). To facilitate effective teaching and learning, all new students of the program are placed into one of 12 proficiency levels based on results of a placement test, which consists of five sections (i.e., listening, grammar, reading, writing, and speaking).

A majority of the CEP teachers are MA students of the TESOL and applied linguistics programs. That is, they are student teachers practicing ESL classroom teaching. Therefore, their classrooms are regularly observed by faculty and colleagues and follow-up feedback sessions are provided throughout the semester. The teachers also serve as raters of the writing and speaking placement tests. From the rating experience, they not only become familiar with the CEP students’ writing and speaking ability levels, but they also have an opportunity to have hands-on experience in evaluating ESL learners’ writing and speaking ability. Therefore, the CEP functions as a teacher education program as well as an adult ESL program. Participants

Participants in the current study consisted of 215 incoming CEP students who took the CEP speaking placement test. The majority of students in the program were adult immigrants from the surrounding neighborhood or were family members of international students in the Columbia University community. The number of female students (73%) far exceeded that of male students (27%). In terms of the participants’ first language, a large percentage consisted of three languages: Japanese (36%), Korean (19%), and Spanish (15%). With regard to their length of residence, the vast majority of the participants responded that they had been in English speaking countries, including the United States, for fewer than three years: “less than 6 months” (40%), “6 months to 1 year” (19%), and “1 to 3 years” (20%). In terms of their motivation for studying English, many participants reported academic and job-related reasons, while over 50 percent gave priority to communication with friends as their reason for improving their English.

Instruments The instruments used in the current study included the CEP placement speaking test and an analytic scoring rubric. The speaking test was designed to measure speaking ability under various real-life language use situations. The test had six tasks: complaining about a catering service (Task 1), talking about a favorite movie (Task 2), narrating a story based on a sequence of pictures (Task 3), refusing a request from a landlord (Task 4), summarizing a radio commentary (Task 5), and summarizing a lecture (Task 6). The first three tasks (i.e., Tasks 1, 2, and 3) were the independent-skills tasks, which required examinees to draw on their background knowledge to perform the tasks. On the other hand, the last three tasks (i.e., Tasks 4, 5, and 6) were the integrated-skills tasks, which required examinees to use their listening skills in the performance of the tasks. That is, examinees were asked to listen to long

Page 13: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

5Investigating the Construct Validity of a Speaking Performance Test

or short passages, which were provided as part of the tasks, and then formulate responses based on the content of the passages. The speaking test was a semi-direct, computer-delivered test. That is, there was no interaction between an examinee and an interlocutor. Instead, the examinees listened to the pre-recorded instructions and prompts delivered by a computer and then they were asked to record their responses. The six tasks and the test format for each task (e.g., preparation time, response time) are found in Appendix A. An analytic scoring rubric consisting of five rating scales (see Appendix B) was used to score the examinees’ recorded oral responses. The five scales included meaningfulness, grammatical competence, discourse competence, task completion, and intelligibility. Each of the five rating scales was rated on a six-point scale (0 for “no control” to 5 for “excellent control”). To analyze each scale in relation to the different tasks in this study, the five scales for each of the six tasks were regarded as individual items, making a total of 30 items (6 tasks x 5 rating scales) on the test. That is, each cell in Table 1 illustrates the individual items of the test. For instance, the item “MeanT1” represents meaningfulness for Task 1 while the item“MeanT2” refers to meaningfulness for Task 2. Table 1. Taxonomy of Items (Task x Rating Scale) on Speaking Ability

Tasks Rating scales Number

of Items Task 1 Task 2 Task 3 Task 4 Task 5 Task 6

Meaningfulness 6 MeanT1 MeanT2 MeanT3 MeanT4 MeanT5 MeanT6

Grammatical competence 6 GramT1 GramT2 GramT3 GramT4 GramT5 GramT6

Discourse competence 6 DiscT1 DiscT2 DiscT3 DiscT4 DiscT3 DiscT6

Task completion 6 TaskT1 TaskT2 TaskT3 TaskT4 TaskT5 TaskT6

Intelligibility 6 IntelT1 IntelT2 IntelT3 IntelT4 IntelT5 IntelT6

Total 30

Procedures Test Administration The speaking test was administered in a computer lab on the second day of a two-day placement test administration. The test was administered to groups of approximately 40 students. Each student was seated in front of a computer. They listened to the test instructions on a headset, read the instructions on the computer screen, and recorded their responses to the test items using a microphone. Since all computers were controlled from a central console, the examinees kept the same pace while taking the test. That is, the instructions and prompts were delivered at the same time, and the preparation and response times were also provided to all examinees at the same time.

Before the actual test began, the examinees were asked to fill in a background survey which asked for demographic information, prior English-learning experience, and plans for

Page 14: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

76 H. J. Kim

future study. Once all examinees of a group completed the survey, they were given a practice task so that they would be familiar with the test format. After a short intermission for any questions about the test format, the six tasks were played in sequence. For each task, the examinees first listened to or looked at an instruction and a prompt. They were allowed to prepare responses during a short preparation time and lastly they recorded their responses during the given response time. Scoring

Each examinee’s performance was scored by two independent raters. The raters were the CEP teachers, most of whom were MA or EdD students in the TESOL and applied linguistics programs at Teachers College. Prior to the actual rating, the raters attended a norming session in which the test tasks and the rubric were introduced and sample responses were provided for practice. Time was also given for discussion of analytic scores so that the raters had opportunities to monitor their decision-making processes by comparing the rationale behind their scores with other raters’ opinions. Rating practice and discussion continued until the raters felt that they were well aware of the tasks and confident with assigning scores on different rating scales. Following the norming session, each rater was assigned a certain number of examinees. Since examinees’ performance on each of the six tasks was scored on the five rating scales, each examinee was given 30 analytic ratings on 30 items. The maximum score for each item was five and the minimum was zero. The scores assigned by two independent raters were later averaged to determine a speaking score for the placement test. Analyses The data were analyzed using SPSS version 12.0 (SPSS Inc., 2001) and EQS version 6.1 (Bentler & Wu, 2005). Descriptive statistics (i.e., means, standard deviations, maximum/minimum raw scores, and skewness and kurtosis values) were calculated for the entire test, for the 30 individual items, and for each of the five rating scales across the six tasks separately using SPSS to verify central tendency and variability. Reliability estimates were then calculated based on Cronbach’s Alpha to examine the degree of relatedness among the 30 items and the six items under each of the five rating scales. Also, the degree of agreement between the two raters (i.e., inter-rater reliability) was investigated from various perspectives, such as from the examinees’ total score, across the six tasks, and across the five rating scales. Since composite scores comprised interval data that were converted from the original ordinal data, inter-rater reliability was estimated based on Pearson Product-Moment correlations. After calculating descriptive statistics and reliability estimates, exploratory factor analyses (EFA) were conducted to determine the extent to which the 30 items clustered together. In other words, factor analyses were used to examine what patterns of correlations would be observed among the 30 items. Based on the correlation matrix, initial factors were extracted by principal-axes factoring (PAF) after the appropriateness of the use of a correlation matrix for factor analysis was verified using three calculations: (1) Bartlett’s test of sphericity; (2) the Kaiser-Meyer-Olkin (KMO); and (3) the determinant of the correlation matrix. The initial factors were then rotated until the best solution was found to determine the number of underlying factors. Since it had been assumed that the factors were correlated with

Page 15: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

7Investigating the Construct Validity of a Speaking Performance Test

one another, a direct oblimin rotation procedure was used after checking the factor correlation matrices each time. Finally, confirmatory factor analyses (CFA) were performed to establish a model of the speaking test. CFA was used to determine the extent to which the 30 items were measured in relation to the six tasks and five scoring criteria. Based on a review of the literature, a second-order Multitrait-Multimethod (MTMM) Model was first hypothesized. After failing to find an appropriate solution with the hypothesized model, several other CFA models were attempted to find a final model that best explained the data. To assess the adequacy of models including the hypothesized model, several fit indices were used such as the Chi-square statistic, the Chi-square/df ratio, the comparative fit index (CFI), and the root mean-square error of approximation (RMSEA). In addition, a distribution of standardized residuals was checked. The results of the Lagrange Multiplier test and Ward test were analyzed for each run in order to check any necessary and unnecessary parameters in a model. In the end, however, a final speaking test model was chosen in accordance with substantive considerations while taking into account the issue of parsimony. In the process of model evaluation, the ML Robust method was used each time due to multivariate non-normality of the data.

Results

Descriptive Statistics The descriptive statistics which were calculated for the item level, the rating scale level, and the entire 30-item test are presented in Table 2. The item-level means ranged from 2.64 to 3.41 and the standard deviations from 1.01 to 1.57. Although not very different, the means of grammar-related items (i.e., GramT1 to GramT6) were lower than those for the other groups of items. On the other hand, task completion-related items (i.e., TaskT1 to Task T6) showed relatively higher means compared to the other items. Grammar-related items had the least variability (average Std.=1.04) while task completion-related items had the largest variability (average Std.=1.23). With regard to the task-related aspect, Task 6 items (i.e., MeanT6, GramT6, DiscT6, TaskT6, and IntelT6) had the lowest means under each rating scale. However, their standard deviations were greatest compared to those for the other task items under the same rating scale. The skewness and kurtosis values, within the acceptable range, indicated that all 30 items and five rating scales appeared to be normally distributed. Reliability Analyses The reliability estimates for internal consistency were calculated for the five rating scales and for the entire test (see Table 3). The reliability estimate for the entire test was very high (0.991), signifying a high degree of homogeneity among the 30 items. Internal consistency reliability for each rating scale also showed a high degree of consistency of the six tasks under the five scales. The high reliability estimates, ranging from 0.936 to 0.963, suggested that the six tasks measured the same construct with a high degree of consistency within each rating scale.

Page 16: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

98 H. J. Kim

Table 2. Descriptive Statistics (N=215, K=30) Variable Minimum Maximum Mean Std. Skewness Kurtosis

1. Meaningfulness (Mean) 0 5.00 3.10 1.14 -.87 .30 MeanT1 0 5.00 3.12 1.28 -.81 .24 MeanT2 0 5.00 3.12 1.18 -.93 .65 MeanT3 0 5.00 3.20 1.11 -.82 .60 MeanT4 0 5.00 3.10 1.30 -.84 -.07 MeanT5 0 5.00 3.20 1.24 -.91 .27 MeanT6 0 5.00 2.85 1.38 -.66 -.45

2. Grammar (Gram) 0 4.58 2.87 1.04 -.95 .52 GramT1 0 5.00 2.83 1.15 -.84 .46 GramT2 0 4.50 2.87 1.04 -1.11 1.16 GramT3 0 4.50 2.92 1.01 -.92 .83 GramT4 0 5.00 2.93 1.20 -.97 .35 GramT5 0 5.00 2.93 1.12 -.92 .46 GramT6 0 4.50 2.73 1.27 -.79 -.29

3. Discourse Competence (Disc) 0 4.50 2.86 1.08 -.91 .33

DiscT1 0 5.00 2.83 1.21 -.78 .17 DiscT2 0 5.00 2.84 1.09 -.91 .59 DiscT3 0 5.00 2.95 1.05 -.83 .75 DiscT4 0 5.00 2.93 1.26 -.85 -.04 DiscT5 0 5.00 2.97 1.18 -.84 .17 DiscT6 0 5.00 2.64 1.32 -.63 -.42

4. Task Completion (Task) 0 5.00 3.18 1.23 -.86 .08 TaskT1 0 5.00 3.07 1.41 -.52 -.44 TaskT2 0 5.00 3.36 1.33 -1.00 .32 TaskT3 0 5.00 3.41 1.23 -.92 .42 TaskT4 0 5.00 3.04 1.57 -.56 -1.01 TaskT5 0 5.00 3.37 1.41 -.82 -.25 TaskT6 0 5.00 2.86 1.48 -.52 -.73

5. Intelligibility (Intel) 0 4.92 3.02 1.09 -.92 .53 IntelT1 0 5.00 2.99 1.20 -.89 .50 IntelT2 0 5.00 3.00 1.15 -.96 .73 IntelT3 0 5.00 3.09 1.05 -.93 .86 IntelT4 0 5.00 3.07 1.25 -.90 .20 IntelT5 0 5.00 3.10 1.16 -.91 .67 IntelT6 0 5.00 2.89 1.30 -.75 -.16

Total (30 items) 0 4.73 3.01 1.10 -.94 .43

Page 17: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

9Investigating the Construct Validity of a Speaking Performance Test

Table 3. Reliability Estimates (N=215)

Construct Items Used Nr of Items Reliability Estimates

Meaningfulness MeanT1 - MeanT6 6 0.960 Grammatical Competence GramT1 - GramT6 6 0.963 Discourse Competence DiscT1 - DiscT6 6 0.958 Task Completion TaskT1 - TaskT6 6 0.936 Intelligibility IntelT1 - IntelT6 6 0.963 Total 30 0.991

Although average scores by the two raters were used for the statistical analyses, inter-rater reliability was calculated to determine the degree of agreement between the two raters. The correlation between Rater 1 and Rater 2 was 0.837 for examinees’ total score (see Table 4), 0.71 to 0.80 across the six tasks (see Table 5), and 0.78 to 0.82 across the five rating scales (see Table 6). All correlations were significant at the alpha = 0.01 level, indicating that the first rater’s score on each task, each rating scale, and entire test significantly correlated with the second rater’s score on the same task, rating scale, and entire test. As a result, it can be assumed that the two raters scored the examinees’ speaking with similar criteria in mind.

Table 4. Inter-rater Reliability for the Entire Speaking Test (N = 215) Rater 1 (TotR1) Rater 2 (TotR2)

Rater 1 (TotR1) 1.00 0.837** Rater 2 (TotR2) 0.837** 1.00

**p < 0.01 (2-tailed), R1 = Rater 1, R2 = Rater 2 Table 5. Inter-rater Reliability across Six Tasks (N = 215)

T1R1 T1R2 T2R1 T2R2 T3R1 T3R2 T4R1 T4R2 T5R1 T5R2 T6R1 T6R2 T1R1 1.00 0.80** T1R2 1.00 T2R1 1.00 0.75** T2R2 1.00 T3R1 1.00 0.71** T3R2 1.00 T4R1 1.00 0.81** T4R2 1.00 T5R1 1.00 0.80** T5R2 1.00 T6R1 1.00 0.80** T6R2 1.00

**p < 0.01 (2-tailed), T1–T6: Task 1–Task 6; R1 = Rater 1, R2 = Rater 2

Page 18: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

1110 H. J. Kim

Table 6. Inter-rater Reliability across the Five Constructs (N = 215) MR1 MR2 GR1 GR2 DR1 DR2 TR1 TR2 IR1 IR2

MR1 1.00 0.78** MR2 1.00 GR1 1.00 0.82** GR2 1.00 DR1 1.00 0.80** DR2 1.00 TR1 1.00 0.80** TR2 1.00 IR1 1.00 0.81** **p < 0.01 (2-tailed), M: Meaningfulness, G: Grammatical Competence, D: Discourse Competence, T: Task Completion, I: Intelligibility, R1 = Rater 1, R2 = Rater 2 Results of Exploratory Factor Analysis

Once the appropriateness of the use of a correlation matrix for factor analysis was verified (e.g., a significant Chi-square, the positive determinant of the correlation matrix), an EFA was conducted as a preliminary step for a CFA in order to develop a factor structure for the 30 observed variables. The initial factor extraction showed a very different result from the hypothesized design of speaking ability, which assumed five underlying factors (i.e., five rating scales). Two factors with eigenvalues greater than 1.0 were extracted, which accounted for 83.7 percentof the variance. Variable communalities were all above 0.7, specifying that the variances of the variables accounted for by the common factors were very high. The scree plot also suggested the extraction of two factors. Since the number of factors obtained from the initial extraction was quite different from the hypothesis set for the speaking test, solutions with different numbers of factors were compared. The three factor oblique rotation was the best solution to achieve maximum parsimony (see Table7). As observed in Table 7, the 30 items used to measure speaking ability clustered around the type of task. For instance, items for Tasks 1, 2, and 3 loaded on Factor 1, items for Task 6 loaded on Factor 2, and items for Tasks 4 and 5 loaded on Factor 3. To illustrate, all five items for Task 6 (i.e., MeanT6, GramT6, DiscT6, TaskT6, and IntelT6) showed factor loadings above 0.3 for Factor 2.

Further analysis of the six tasks revealed a possible reason as to why the items clustered around the task type factors rather than around the rating scales. Since Tasks 1, 2, and 3 required examinees to speak with the minimal input, the factor on which the items for these three tasks loaded was interpreted as a “Speak” factor. Contrary to Tasks 1, 2, and 3, Tasks 4 and 5 first required examinees to listen to a long message and then respond or summarize it. Thus, Factor 3, which included items for Tasks 4 and 5, was coded as a “Listen and Speak” factor. While Task 6 was a summary task (as was Task 5), it appeared that Task 6 required examinees to have topical knowledge in the process of listening and summarizing a message. That is, examinees’ familiarity with the topic of the task could help them approach the task easily. Whereas Task 5 was about a topic (an electric car) that might be more commonly discussed in everyday life, the listening prompt provided in Task 6 was a lecture with highly specified content (the Barbizon School). Thus, Factor 2 was coded as “Listen and

Page 19: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

11Investigating the Construct Validity of a Speaking Performance Test

Speak with Topical Knowledge.” In sum, the items did not cluster around operationalized constructs of speaking ability (i.e., rating scales), showing that examinees’ speaking performance was better explained according to the task type rather than to the hypothesized five constructs of speaking ability. As a result, the two cross-loadings present (i.e., IntelT3 and GramT5) were not seen as problematic since grammar and intelligibility could be involved in any task as long as factors were divided based on the task type. The final three-factor solution is presented in Table 8. Table 7. Pattern Matrix for Speaking Ability

Factor 1 2 3

DiscT2 1.015 .054 .171 GramT2 .944 .142 .163 TaskT2 .890 .017 .014 GramT1 .874 .027 -.037 IntelT1 .873 -.019 -.054 DiscT1 .856 -.006 -.061 IntelT2 .852 .139 .069 MeanT2 .847 .135 .037 MeanT1 .837 -.052 -.138 TaskT3 .795 -.090 -.138 GramT3 .764 -.020 -.207 DiscT3 .720 -.030 -.236 MeanT3 .702 .020 -.212 TaskT1 .670 .028 -.190 IntelT3 .585 .015 -.337 MeanT6 .011 .962 -.004 TaskT6 -.044 .933 -.063 DiscT6 .056 .918 -.014 GramT6 .094 .847 -.053 IntelT6 .033 .804 -.143 MeanT4 .076 -.004 -.897 GramT4 .108 .058 -.809 TaskT4 -.057 .125 -.791 IntelT4 .090 .120 -.769 DiscT4 .135 .100 -.741 TaskT5 .066 .253 -.634 IntelT5 .189 .205 -.584 MeanT5 .224 .182 -.573 DiscT5 .288 .133 -.564 GramT5 .319 .179 -.484

Extraction Method: Principal Axes Factoring. Rotation Method: Oblimin with Kaiser Normalization. a Rotation converged in 13 iterations.

Page 20: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

1312 H. J. Kim

Table 8. Revised Taxonomy of Speaking Ability (Based on Exploratory Factor Analysis)

Factors Nr of Items Items

Speak 15 Task 1 Task 2 Task 3

5 5 5

MeanT1, GramT1, DiscT1, TaskT1, IntelT1 MeanT2, GramT2, DiscT2, TaskT2, IntelT2 MeanT3, GramT3, DiscT3, TaskT3, IntelT3

Listen & Speak with Topical Knowledge 5

Task 6 5 MeanT6, GramT6, DiscT6, TaskT6, IntelT6 Listen & Speak 10 Task 4 Task 5

5 5

MeanT4, GramT4, DiscT4, TaskT4, IntelT4 MeanT5, GramT5, DiscT5, TaskT5, IntelT5

Total 30 Results of Confirmatory Factor Analysis Bachman (2002) argued that a language test should be designed taking task characteristics into account as well as the construct definition of language ability in order to achieve the intended purpose of the test. In an attempt to understand the speaking test of the current study in terms of both aspects (i.e., construct definition and task characteristics), the first MTMM model was hypothesized in which the 24 items loaded on both trait factors (i.e., the four rating scales) and method factors (i.e., the six tasks), while the four trait factors loaded on a second-order factor, speaking ability (see Figure 2). The rating scale of task completion was not included as a trait factor in the model since it was considered redundant in relation to the other rating scales. As a result, six items related to task completion (i.e., TaskT1, TaskT2, TaskT3, TaskT4, TaskT5, TaskT6) were deleted for the analysis, making a total of 24 observed variables. Moreover, correlations among six tasks were not established in the first model because six different tasks were hypothesized to elicit different aspects of speaking ability.

In order to respecify the first model, several attempts were made. First, it was tested whether four first-order factors (i.e., four trait factors) would load on the second-order factor (i.e., speaking ability) without any method factors (see Figure 3). The data did not fit the model, which indicated problems similar to those of the first model (e.g., condition codes and factor loadings above 1.0). Moreover, the model showed a very poor fit, with a CFI of 0.715 and a RMSEA of 0.162. The results confirmed a need for consideration of both construct (i.e., rating scales) and task to interpret test scores, since the model without the task factors did not represent the data. In addition, based on the results of this model, it was decided that four trait factors should be correlated instead of using of a second-order factor. The model-fit evaluation of the hypothesized model indicated an excellent fit, showing the very high CFI (0.99) and the very low RMSEA (0.032 with the confidence interval [0.015, 0.044]). In terms of fit indices, the model was ideal since the CFI above 0.95 and the RMSEA below 0.05 are considered an indication of a well-fitting model (Byrne, 2006). However, the test results were not reliable due to a condition code for a variance of factor error (Parameter: D2, D2) which caused an improper solution (e.g., the greater than 1.0 factor loading for Grammatical

Page 21: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

13Investigating the Construct Validity of a Speaking Performance Test

Competence). Such a condition code, which is a common occurrence with MTMM data, might have occurred due to the complexity of model specification (Byrne, 2006). Thus, the initially hypothesized model was rejected.

Figure 2. The Hypothesized Second-Order MTMM Model of CEP Speaking Placement Test Mean: Meaningfulness, Gram: Grammatical Competence, Disc: Discourse Competence, Intel: Intelligibility, T1–T6: Task 1–Task 6

Page 22: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

1514 H. J. Kim

Figure 3. The Second-order Model without Method Factors Mean: Meaningfulness, Gram: Grammatical Competence, Disc: Discourse Competence, Intel: Intelligibility, T1–T6: Task 1–Task 6 Another attempt was made before deciding upon a final model. A model was tested with two additional factors: Rating 1 and Rating 2. The model was run both with and without the correlation between the two ratings. However, both models were unsuccessful, which confirmed that the data were not explained with such models. Therefore, based on an examination of several possible models, the final MTMM model was established with four trait factors which were correlated with each other and six method factors (see Figure 4). This final model was obtained after statistically testing two assumptions which were made in advance. The first assumption regarding the deletion of task completion factor was confirmed since the inclusion of task completion factor to the final model lowered the overall fit of the data. To test the other assumption related to possible task effect, the final MTMM model was

Page 23: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

15Investigating the Construct Validity of a Speaking Performance Test

also tested with correlations among six method factors. Although the overall fit increased, it showed very little improvement. Thus, it was concluded that six different tasks measured different aspects of speaking ability so that the correlations were not included in the final model. Though all estimates were statistically significant, they were not included in Figure 4 since they were not legible with the overabundance of arrows (Refer to Table 10 for the estimates). As shown in Figure 4, there were 24 dependent variables (i.e., 24 observed variables) and 34 independent variables (i.e., 10 factors and 24 error terms). There were also 78 parameters (i.e., 48 factor loadings, 6 factor covariances, 24 error variances) and 34 fixed nonzero parameters (i.e., 10 factor variances, 24 error regression paths). The structure of these factors and variables as specified in the model was tested based on the covariance matrix. Following the summary of the model, model identification was confirmed in the output. The model was first assessed as a whole. In terms of residuals, off-diagonal elements were examined since they play a major role in the effect of Chi-square statistics. The standardized residual values were evenly distributed, and the average off-diagonal absolute standardized residual was also quite small, at 0.0156. In addition, the distribution of standardized residuals was symmetric and centered around zero. As a result, it was found that very little discrepancy existed between S(q) (covariance matrix implied by the specified structure of the hypothesized model) and S (sample covariance matrix of observed variable scores). With regard to the goodness of fit statistics, the independence Chi-square statistic was 5189.140 with 276 degrees of freedom. Although the Chi-square/df ratio was much greater than 2, implying a poor model-data fit, it was ignored due to Chi-square sensitivity to sample size. Instead, fit indices were used for further model-fit evaluation (see Table 9). Table 9. EQS Output – Goodness of Fit Statistics

GOODNESS OF FIT SUMMARY FOR METHOD = ROBUST ROBUST INDEPENDENCE MODEL CHI-SQUARE = 5189.140 ON 276 DEGREES OF FREEDOM INDEPENDENCE AIC = 4637.140 INDEPENDENCE CAIC = 3430.844 MODEL AIC = -179.250 MODEL CAIC = -1149.532 SATORRA-BENTLER SCALED CHI-SQUARE = 264.7500 ON 222 DEGREES OF FREEDOM PROBABILITY VALUE FOR THE CHI-SQUARE STATISTIC IS 0.02603

FIT INDICES BENTLER-BONETT NORMED FIT INDEX = 0.949 BENTLER-BONETT NON-NORMED FIT INDEX = 0.989 COMPARATIVE FIT INDEX (CFI) = 0.991 BOLLEN'S (IFI) FIT INDEX = 0.991 MCDONALD'S (MFI) FIT INDEX = 0.905 ROOT MEAN-SQUARE ERROR OF APPROXIMATION (RMSEA) = 0.030 90% CONFIDENCE INTERVAL OF RMSEA (0.011, 0.043)

Page 24: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

1716 H. J. Kim

Figure 4. The Final MTMM Model Mean: Meaningfulness, Gram: Grammatical Competence, Disc: Discourse Competence, Intel: Intelligibility, T1 – T6: Task 1 – Task 6; F1 – F10: Factors 1 – Factor 10; V2 – V31: Observed Variables 2 – 31

Page 25: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

17Investigating the Construct Validity of a Speaking Performance Test

As shown in Table 9, the CFI was 0.991 and the RMSEA was 0.03 with the confidence interval [0.011, 0.043], both of which indicated an excellent fit. The final indicator of overall model fit was the number of iterations. According to the iterative summary in the output, only five iterations were needed to reach convergence, which meant that the data fit the model relatively easily. Thus, it was revealed from the analyses of residuals and fit indices that the current 24 data fit the 10 factor MTMM model well as a whole.

After confirming the good fit of the model as a whole, the fit of individual parameters was also assessed. The statistical significance of parameter estimates was first checked based on the unstandardized estimates. All parameter estimates were statistically significant. Therefore, all parameters could be considered important to the model, and none of the parameters needed to be deleted from the model. Following the unstandardized estimates, a standardized solution was considered (see Table 10).

As shown in Table 10 (next page), the trait factor loadings (i.e., F1 to F4), ranging from 0.849 to 0.926, were much higher than method factor loadings (i.e., F5 to F10), ranging from 0.265 to 0.466. This signified that the four traits (i.e., rating scales) were much stronger indicators than the six tasks, although both needed to be considered. Since the regression coefficients of errors were quite small, ranging from 0.207 to 0.309, it can be concluded that the contribution of errors to the variables was low and the variables were mainly explained by the factors. All of the very high R-squared values, which refer to the proportion of variance accounted for by its related factors, confirmed that all 24 items explained the model fairly well. Moreover, as assumed above, correlations between the trait factors were quite high at around 0.98. The four factors were all operationalized constructs of a single construct of speaking ability. However, extremely high correlations were not considered ideal for analytic scoring since they indicated that four rating scales were almost indistinguishable.

Discussion and Conclusion

The present study examined the underlying structure of the CEP speaking placement

test based on a confirmatory factor analysis. The analysis was conducted with four trait factors (i.e., meaningfulness, grammatical competence, discourse competence, and intelligibility) and six method factors (i.e., Tasks 1 to 6). Also, the four trait factors were correlated with one another. Although these four traits were assumed to be related by virtue of being aspects of the same ability, correlations over 0.90 were unexpected. These high correlations may indicate that speaking ability cannot be separated into several analytic aspects, or the raters failed to understand and differentiate among the analytic scoring criteria. For example, raters may have given similar scores to the four rating scales of each task based on their own impression rather than going over the different criteria carefully, or they may not have been accustomed to the different criteria because of the short norming period. Further research on raters’ rating processes may be required to explain the relationship among these components of speaking ability.

Page 26: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

1918 H. J. Kim

Table 10. EQS Output – Standardized Solution

Page 27: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

19Investigating the Construct Validity of a Speaking Performance Test

The final MTMM model explained the current test data very well, as evidenced by the high fit indices. In particular, the four operationalized constructs (i.e., four rating scales) primarily explained the data with higher factor loadings than the six tasks. In other words, examinees’ performance on the test was mainly explained by the four constructs of speaking ability; however, the characteristics of the six tasks had a non-negligible effect on the examinees’ performance. Therefore, the results of the current study empirically supported the interactionalist perspective in which examinees’ speaking ability is determined in terms of both constructs (traits) and task characteristics of the test.

Although the current study contributes to the recent discussion concerning the importance of both construct definitions and test task characteristics in L2 performance assessments, it has a number of limitations. First, due to a limited sample size, it was not possible to include a rating factor as part of the underlying structure of the speaking test although multiple ratings were available for all examinees’ responses. It has been argued that raters are the one of the factors that affects examinees’ performance (Kenyon, 1992; Linacre, 1989; McNamara, 1995, 1996, 1997). Indeed, previous studies on raters, which analyzed raters’ rating behaviors both quantitatively and qualitatively, showed rater effects on performance assessments (e.g., Bonk & Ockey, 2003; Brown, 2005; Chalhoub-Deville, 1995; Eckes, 2005; Meiron & Schick, 2000; Orr, 2002). Therefore, inclusion of a rater/rating factor might change the underlying structure of the speaking test.

The other limitation is that structural equation modeling is a data-specific statistical tool. In other words, the results of the current analyses cannot be generalized to other CEP speaking data which include different participants. Likewise, other data sets might be explained with different factors or different factorial structures. Therefore, in order to generalize the structure of CEP speaking placement test, repeated analyses of test data with a larger sample size are required across different test administrations. Only then can the nature of the CEP speaking placement test be understood and, ultimately, can inferences made on examinees’ speaking ability be considered reliable.

Acknowledgements

I would like to express my appreciation to the English Language Institute at the University of Michigan for giving me an opportunity to perform this research. I am also very grateful to Professor James Purpura and my colleagues at Teachers College, for their insightful comments and suggestions throughout this study.

References Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford

University Press. Bachman, L. F. (2002). Some reflections on task-based language performance assessment.

Language Testing, 19(4), 453–476. Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and

developing useful language tests. Oxford: Oxford University Press. Barnwell, D. (1989). Naive native speakers and judgments of oral proficiency in Spanish.

Language Testing, 6(2), 152–163.

Page 28: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

2120 H. J. Kim

Bentler, P. M., & Wu, E. (2005). EQS 6.1 for windows user’s guide. Encino, CA: Multivariate Software, Inc.

Bonk, W. J., & Ockey, G. J. (2003). A many-facet Rasch analysis of the second language group oral discussion task. Language Testing, 20(1), 89–110.

Brindley, G. (1994). Task-centred assessment in language learning: The promise and the challenge. In N. Bird, P. Falvey, A. Tsui, D. Allison, & A. McNeill (Eds.), Language and learning: Papers presented at the Annual International Language in Education Conference (Hong Kong, 1993) (pp. 73–94). Hong Kong: Hong Kong Education Department.

Brown, A. (1995). The effect of rater variables in the development of an occupation-specific language performance test. Language Testing, 12(1), 1–15.

Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency. Language Testing, 20(1), 1–25.

Brown, A. (2005). Interviewer variability in oral proficiency interviews. Frankfurt, Germany: Peter Lang.

Brown, J. D., Hudson, T., Norris, J. M., & Bonk, W. (2002). An investigation of second language task-based performance assessments. Honolulu: University of Hawaii Press.

Byrne, B. M. (2006). Structural equation modeling with EQS. Mahwah NJ: Lawrence Erlbaum Associates, Inc.

Canale, M. (1983). On some dimensions of language proficiency. In J. W. Oller, Jr. (Ed.), Issues in language testing research (pp. 333–342). Rowley, MA: Newbury House.

Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to second language teaching and testing. Applied Linguistics, 1(1), 1–47.

Chalhoub-Deville, M. (1995). Deriving oral assessment scales across different tests and rater groups. Language Testing, 12(1), 16–33.

Chapelle, C. (1998). Construct definition and validity inquiry in SLA research. In L. F. Bachman & A. D. Cohen (Eds.), Interfaces between second language acquisition and language testing research (pp. 32–70). Cambridge: Cambridge University Press.

Clark, J. L. D. (1975). Theoretical and technical considerations in oral proficiency testing. In R. L. Jones, & B. Spolsky (Eds.), Testing language proficiency (pp. 10–28). Arlington, VA: Center for Applied Linguistics.

Clark, J. L. D. (1988). Validation of a tape-mediated ACTFL/ILR-scale based test of Chinese speaking proficiency. Language Testing, 5(2), 187–205.

Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2(3), 197–221.

Elder, C. (1993). How do subject specialists construe classroom language proficiency? Language Testing, 10(3), 235–254.

Elder, C., Iwashita, N., & McNamara, T. (2002). Estimating the difficulty of oral proficiency tasks: What does the test-taker have to offer? Language Testing, 19(4), 347–368.

Farris, C. S. (1995). A semiotic analysis of sajiao as a gender marked communication style in Chinese. In M. Johnson & F. Y. L. Chiu (Eds.), Unbound Taiwan: Close-ups from a distance. Selected Papers Vol. 8 (pp. 1–29). Chicago: Center for East Asian Studies, University of Chicago.

Fulcher, G. (2003). Testing second language speaking. London: Longman.

Page 29: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

21Investigating the Construct Validity of a Speaking Performance Test

Jones, R. L. (1985). Second language performance testing: An overview. In P. C. Hauptman, R LeBlanc, & M. B. Wesche (Eds.), Second language performance testing (pp. 15–24). Ottawa: University of Ottawa Press.

Kenyon, D. M. (1992). Introductory remarks at symposium on Development and use of rating scales in language testing, 14th Language Testing Research Colloquium, Vancouver, February 27th –March 1st.

Kim, M. (2001). Detecting DIF across the different language groups in a speaking test. Language Testing, 18(1), 89–114.

Kim, Y. (2009). An investigation into native and non-native teachers’ judgments of oral English performance: A mixed methods approach. Language Testing, 26(2), 187–217.

Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago: MESA Press. Lumley, T. (1998). Perceptions of language-trained raters and occupational experts in a test of

occupational English language proficiency. English for Specific Purposes, 17, 347–67. Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for

training. Language Testing, 12(1), 54–71. Lumley, T., & O’Sullivan, B. (2005). The effect of test-taker gender, audience and topic on

task performance in tape-mediated assessment of speaking. Language Testing, 22(4), 415–437.

Luoma, S. (2004). Assessing speaking. Cambridge: Cambridge University Press. Lynch, B. K., & McNamara, T. F. (1998). Using G-theory and many-facet Rasch

measurement in the development of performance assessments of the ESL speaking skills of immigrants. Language Testing, 15(2), 158–180.

Malabonga, V., Kenyon, D. M., & Carpenter, H. (2005). Self-assessment, preparation and response time on a computerized oral proficiency test. Language Testing, 22(1), 59–92.

McNamara, T. F. (1995). Modelling performance: Opening pandora’s box. Applied Linguistics, 16(2), 159–179.

McNamara, T. F. (1996). Measuring second language performance. London: Longman. McNamara, T. F. (1997). ‘Interaction’ in second language performance assessment: Whose

performance? Applied Linguistics, 18(4), 446–466. Meiron, B., & Schick, L. (2000). Ratings, raters and test performance: An exploratory study.

In A. J. Kunnan (Ed.), Fairness and validation in language assessment. Selected papers from the 19th Language Testing Research Colloquium, Orlando, Florida (pp. 60–81). Cambridge: Cambridge University Press.

Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23(2), 13–23.

Morrow, K. (1979). Communicative language testing: Revolution or evolution? In C. J. Brumfit, & K. Johnson (Eds.), The communicative approach to language teaching (pp. 143–157). Oxford: Oxford University Press.

Norris, J. M., Brown, J. D., Hudson, T., & Yoshioka, J. (1998). Designing second language performance assessments (Technical Report No. 18). Honolulu: University of Hawaii, Second Language Teaching & Curriculum Center.

O’Loughlin, K. K. (2002). The impact of gender in oral proficiency testing. Language Testing, 19(2), 169–192.

Orr, M. (2002). The FCE speaking test: Using rater reports to help interpret test scores. System, 30, 143–154.

Page 30: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

2322 H. J. Kim

Savignon, S. J. (1972). Communicative competence: An experiment in foreign language teaching. Philadelphia: The Center for Curriculum Development.

Shohamy, E. (1994). The validity of direct versus semi-direct oral tests. Language Testing, 11(2), 99–123.

Skehan, P. (1998). A cognitive approach to language learning. Oxford: Oxford University Press.

SPSS Inc. (2001). SPSS Base 12.0 for Windows[Computer Software]. Chicago IL: SPSS Inc. Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in

assessing oral interaction. Language Testing, 10(3), 305–335. Wigglesworth, G. (1997). An investigation of planning time and proficiency level on oral test

discourse. Language Testing, 14(1), 85–106.

Page 31: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

23Investigating the Construct Validity of a Speaking Performance Test

Appendix A. Speaking Test Tasks Task 1. Catering service In this task, you need to complain about something. Imagine you have ordered food from Party Planner’s Inc. for your boss’s birthday party. But there was not enough food and it was delivered late. You spent a week planning the party, but it was ruined because of the food. You were extremely upset that it happened. Call the caterer to complain about it. You have 20 seconds to plan. Prompt (Audio) [phone ringing] (Answering Machine) Hi! You’ve reached Party Planner’s Inc. We’re sorry, but we’re not available to take your call right now. Please leave a detailed message after the beep, and we’ll get back to you as soon as possible. [Beep] Test-Taker: (45 sec response time) Task 2. Favorite movie In this task, you will be asked to talk about a movie. Think about a movie that you liked and tell your friend about it. You have 20 seconds to plan. Prompt (Vidio) Your friend: So, what was that movie you liked? What is it about? Test-Taker: (60 sec response time) Task 3. Fly in soup In this task, you need to tell the story in the pictures. Look at the pictures (Pictures are shown on the screen). Imagine this happened yesterday while you were having dinner at the next table. Tell your friend what you saw. You have 60 seconds to plan your response. Prompt (Video) Your friend: So, what happened last night at the restaurant? Test-taker: (60 sec response time)

Page 32: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

2524 H. J. Kim

Task 4. Moving out In this task, you need to refuse a request. Imagine you are renting an apartment from a nice old couple in New York City. You have been living there for over a year. Now, listen to a telephone message from the couple. Hi, this is Mary, your landlady. Tom and I have been trying to contact you, but you never seem to be home. I guess you're really busy these days. Anyway…well, I don't know how to say this, but…our granddaughter is moving to the City next month. She's gonna study at Columbia…and, as you know, living in the city is expensive, and the rents are really high. So, she asked us if she could live in the apartment you have now. I know we just renewed your lease, and we have no right to ask you to move out, and, we really like you, too. But, do you think you can possibly look for a different apartment? We're really sorry about this, but we have to do this for our granddaughter. Since there’s not much time, we'd like to hear from you as soon as possible, so we can let our granddaughter know too. Again, we're sorry…Call and let us know, ok? Thanks. (162 words) (Q) Politely tell your landlady that you can’t move out and explain why. You have 30 seconds to plan. Prompt (Audio) Landlady: Hi. Come on in. Did you get our message? Have you thought about moving out? Test-taker: (45 sec response time) Task 5. Electric cars In this task, you will be asked to summarize a radio commentary for a friend. Imagine your friend, Jim is thinking about buying an electric car. Now, listen to the radio commentary. (Host of the radio commentary) Today, we’re talking about electric cars. As you’re well aware, the conventional cars we drive everyday…use a lot of gasoline. You know, how the price of gasoline is going up…and more importantly, there’s the issue of global warming—these cars release harmful pollutants, like carbon monoxide. So, in reaction to this, engineers have been working on cars that run on electric batteries, so… let’s hear about the current state of the technology. We have a pre-recorded commentary by Ben Smith from General Autos. Well, despite high expectations, the first generation of electric cars turned out to be a complete failure. Why? The first problem is the battery—I mean, current battery technology is still very limited. So electric cars can only travel a short distance before its battery needs recharging. What this means is you can’t make long trips without worrying about the battery running out. They’re only good for short trips like going to the supermarket or picking up the kids from school. And when you turn the air conditioner or the radio on, the battery is used up even quicker. Then, you might say, we can just recharge the battery when it’s used up. Well…there’s a serious problem with recharging, too. To recharge a battery, we need an electric outlet, right? But there aren’t many charging stations—which means, the driver might get stuck…without being able to find a charging station nearby. Well, it gets even more frustrating. Even if you can find a station, it takes up to 3 hours to fully recharge a battery. It’s way too long. Well, with these many limitations, does it make sense that anyone would want to buy an electric car, even if it is environmentally friendly?

Page 33: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

25Investigating the Construct Validity of a Speaking Performance Test

(Q) Summarize what you heard on the radio for Jim. Be sure to include two main problems with electric cars. You have 30 seconds to plan. Prompt (Video) Jim: Did I tell you I’m thinking about buying an electric car?… Test-taker: (60 sec response time) Task 6. Barbizon school In this task, you will be asked to summarize a lecture for a classmate. Imagine your classmate, Jennifer missed today’s lecture about the Barbizon school. Now, listen to the lecture. Today, we’ll talk about a group of artists, called the Barbizon School. The Barbizon School is a group of French artists, who lived in the French town, Barbizon and who developed the genre of landscape painting. So, what are their characteristics?

The Barbizon painters tried to find comfort in nature. I mean, they moved away from all the commotion and disruption happening in, then, revolutionary Paris, and sought solace in nature. And nature was the main theme of their paintings—they painted landscapes and scenes of rural life as true to life as possible. And they rejected the idea of manipulating or beautifying nature. Instead, they tried to achieve a true representation of the countryside. OK? Second, in addition to the efforts to paint nature as realistically as possible, they also tried to establish landscape as an independent, legitimate genre in France. Traditionally, landscape painting wasn’t appreciated as a separate genre, but only considered as a background. But Barbizon artists reacted against this convention of classical landscape, and painted landscape for its own sake. With their huge success and recognition, the painters of the Barbizon school established landscape and themes of country life as vital subjects for French artists. Now, let’s look at an example—a painting by Rousseau. This one is called “The Forest in Winter at Sunset”. [Show the painting on screen]. It shows the ancient forest near the village of Barbizon. Rousseau is the best known member of the group. Each Barbizon painter had his own style and specific interests, and Rousseau’s vision was melancholic and sad. Can you feel the depressing mood of the painting? At the top, a tangle of tree limbs, and birds flying into the cloudy, dark, sunset sky. After the sun sets, the forest will be freezing cold. Rousseau worked on this painting off-and-on for twenty years. He considered this his most important painting and refused to sell it during his lifetime. (Q) Summarize the lecture for Jennifer. Be sure to include two main characteristics of the school and the example shown. You have 30 seconds to plan. Prompt (Video) Jennifer: So, what was the lecture about? What did I miss? Test-taker: (60 sec response time)

Page 34: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

2726 H. J. Kim

App

endi

x B

. Ana

lytic

Sco

ring

Rub

ric

M

eani

ngfu

lnes

s (C

omm

unic

atio

n Ef

fect

iven

ess)

Is

the

resp

onse

mea

ning

ful a

nd e

ffec

tivel

y co

mm

unic

ated

?

5 E

xcel

lent

4

Goo

d 3

Ade

quat

e 2

Fair

1

Lim

ited

0 N

o

The

resp

onse

: Th

e re

spon

se:

The

resp

onse

: Th

e re

spon

se:

The

resp

onse

: Th

e re

spon

se:

• is

com

plet

ely

mea

ning

ful—

Wha

t th

e sp

eake

r wan

ts to

co

nvey

is

com

plet

ely

clea

r an

d ea

sy to

un

ders

tand

.

• is

gen

eral

ly

mea

ning

ful—

in

gene

ral,

wha

t the

sp

eake

r wan

ts to

co

nvey

is c

lear

and

ea

sy to

und

erst

and.

• oc

casi

onal

ly

disp

lays

obs

cure

po

ints

; how

ever

, m

ain

poin

ts a

re st

ill

conv

eyed

.

• of

ten

disp

lays

ob

scur

e po

ints

, le

avin

g th

e lis

tene

r co

nfus

ed.

• is

gen

eral

ly u

ncle

ar

and

extre

mel

y ha

rd

to u

nder

stan

d.

• is

in

com

preh

ensi

ble.

• is

fully

ela

bora

ted.

is w

ell e

labo

rate

d.

• in

clud

es so

me

elab

orat

ion.

incl

udes

littl

e el

abor

atio

n.

• is

not

wel

l el

abor

ated

. •

cont

ains

not

eno

ugh

evid

ence

to

eval

uate

• de

liver

s so

phis

ticat

ed id

eas.

• de

liver

s gen

eral

ly

soph

istic

ated

idea

s. •

deliv

ers s

omew

hat

sim

ple

idea

s. •

deliv

ers s

impl

e id

eas.

• de

liver

s ext

rem

ely

sim

ple,

lim

ited

idea

s.

Gra

mm

atic

al C

ompe

tenc

e: A

ccur

acy,

Com

plex

ity a

nd R

ange

5 E

xcel

lent

4

Goo

d 3

Ade

quat

e 2

Fair

1

Lim

ited

0 N

o

The

resp

onse

: Th

e re

spon

se:

The

resp

onse

: Th

e re

spon

se:

The

resp

onse

: Th

e re

spon

se:

• is

gra

mm

atic

ally

ac

cura

te.

• is

gen

eral

ly

gram

mat

ical

ly

accu

rate

with

out a

ny

maj

or e

rror

s (e.

g.,

artic

le u

sage

, su

bjec

t/ver

b ag

reem

ent,

etc.

) tha

t ob

scur

e m

eani

ng.

• ra

rely

dis

play

s maj

or

erro

rs th

at o

bscu

re

mea

ning

and

a fe

w

min

or e

rror

s (bu

t w

hat t

he sp

eake

r w

ants

to sa

y ca

n be

un

ders

tood

).

• di

spla

ys se

vera

l m

ajor

err

ors a

s wel

l as

freq

uent

min

or

erro

rs, c

ausi

ng

conf

usio

n so

met

imes

.

• is

alm

ost a

lway

s gr

amm

atic

ally

in

accu

rate

, whi

ch

caus

es d

iffic

ulty

in

unde

rsta

ndin

g w

hat

the

spea

ker w

ants

to

say.

• di

spla

ys n

o gr

amm

atic

al c

ontro

l.

• di

spla

ys a

wid

e ra

nge

of sy

ntac

tic st

ruct

ures

an

d le

xica

l for

m.

• di

spla

ys a

rela

tivel

y w

ide

rang

e of

sy

ntac

tic st

ruct

ures

an

d le

xica

l for

m.

• di

spla

ys a

som

ewha

t na

rrow

rang

e of

sy

ntac

tic st

ruct

ures

; to

o m

any

sim

ple

sent

ence

s.

• di

spla

ys a

nar

row

ra

nge

of sy

ntac

tic

stru

ctur

es, l

imite

d to

si

mpl

e se

nten

ces.

• di

spla

ys la

ck o

f bas

ic

sent

ence

stru

ctur

e kn

owle

dge.

• di

spla

ys se

vere

ly

limite

d or

no

rang

e an

d so

phis

ticat

ion

of

gram

mat

ical

stru

ctur

e an

d le

xica

l for

m.

• di

spla

ys c

ompl

ex

synt

actic

stru

ctur

es

(rel

ativ

e cl

ause

, em

bedd

ed c

laus

e,

pass

ive

voic

e, e

tc.)

and

lexi

cal f

orm

.

• di

spla

ys re

lativ

ely

com

plex

synt

actic

st

ruct

ures

and

lexi

cal

form

.

• di

spla

ys so

mew

hat

sim

ple

synt

actic

st

ruct

ures

• di

spla

ys u

se o

f si

mpl

e an

d in

accu

rate

lexi

cal

form

.

• di

spla

ys g

ener

ally

ba

sic

lexi

cal f

orm

. •

cont

ains

not

eno

ugh

evid

ence

to e

valu

ate.

• di

spla

ys u

se o

f so

mew

hat s

impl

e or

in

accu

rate

lexi

cal

form

.

Page 35: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

27Investigating the Construct Validity of a Speaking Performance Test

Gra

mm

atic

al C

ompe

tenc

e: A

ccur

acy,

Com

plex

ity a

nd R

ange

5 E

xcel

lent

4

Goo

d 3

Ade

quat

e 2

Fair

1

Lim

ited

0 N

o

The

resp

onse

: Th

e re

spon

se:

The

resp

onse

: Th

e re

spon

se:

The

resp

onse

: Th

e re

spon

se:

• is

gra

mm

atic

ally

ac

cura

te.

• is

gen

eral

ly

gram

mat

ical

ly

accu

rate

with

out a

ny

maj

or e

rror

s (e.

g.,

artic

le u

sage

, su

bjec

t/ver

b ag

reem

ent,

etc.

) tha

t ob

scur

e m

eani

ng.

• ra

rely

dis

play

s maj

or

erro

rs th

at o

bscu

re

mea

ning

and

a fe

w

min

or e

rror

s (bu

t w

hat t

he sp

eake

r w

ants

to sa

y ca

n be

un

ders

tood

).

• di

spla

ys se

vera

l m

ajor

err

ors a

s wel

l as

freq

uent

min

or

erro

rs, c

ausi

ng

conf

usio

n so

met

imes

.

• is

alm

ost a

lway

s gr

amm

atic

ally

in

accu

rate

, whi

ch

caus

es d

iffic

ulty

in

unde

rsta

ndin

g w

hat

the

spea

ker w

ants

to

say.

• di

spla

ys n

o gr

amm

atic

al c

ontro

l.

• di

spla

ys a

wid

e ra

nge

of sy

ntac

tic st

ruct

ures

an

d le

xica

l for

m.

• di

spla

ys a

rela

tivel

y w

ide

rang

e of

sy

ntac

tic st

ruct

ures

an

d le

xica

l for

m.

• di

spla

ys a

som

ewha

t na

rrow

rang

e of

sy

ntac

tic st

ruct

ures

; to

o m

any

sim

ple

sent

ence

s.

• di

spla

ys a

nar

row

ra

nge

of sy

ntac

tic

stru

ctur

es, l

imite

d to

si

mpl

e se

nten

ces.

• di

spla

ys la

ck o

f bas

ic

sent

ence

stru

ctur

e kn

owle

dge.

• di

spla

ys se

vere

ly

limite

d or

no

rang

e an

d so

phis

ticat

ion

of

gram

mat

ical

stru

ctur

e an

d le

xica

l for

m.

• di

spla

ys c

ompl

ex

synt

actic

stru

ctur

es

(rel

ativ

e cl

ause

, em

bedd

ed c

laus

e,

pass

ive

voic

e, e

tc.)

and

lexi

cal f

orm

.

• di

spla

ys re

lativ

ely

com

plex

synt

actic

st

ruct

ures

and

lexi

cal

form

.

• di

spla

ys so

mew

hat

sim

ple

synt

actic

st

ruct

ures

• di

spla

ys u

se o

f si

mpl

e an

d in

accu

rate

lexi

cal

form

.

• di

spla

ys g

ener

ally

ba

sic

lexi

cal f

orm

. •

cont

ains

not

eno

ugh

evid

ence

to e

valu

ate.

• di

spla

ys u

se o

f so

mew

hat s

impl

e or

in

accu

rate

lexi

cal

form

.

Page 36: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

2928 H. J. Kim

Dis

cour

se C

ompe

tenc

e: O

rgan

izat

ion

and

Coh

esio

n

5 E

xcel

lent

4

Goo

d 3

Ade

quat

e 2

Fair

1

Lim

ited

0 N

o

The

resp

onse

: Th

e re

spon

se:

The

resp

onse

: Th

e re

spon

se:

The

resp

onse

: Th

e re

spon

se:

• is

com

plet

ely

cohe

rent

. •

is g

ener

ally

co

here

nt.

• is

occ

asio

nally

in

cohe

rent

. •

is lo

osel

y or

gani

zed,

re

sulti

ng in

gen

eral

ly

disj

oint

ed d

isco

urse

.

• is

gen

eral

ly

inco

here

nt.

• is

inco

here

nt.

• is

logi

cally

st

ruct

ured

—lo

gica

l op

enin

gs a

nd

clos

ures

; log

ical

de

velo

pmen

t of

idea

s.

• di

spla

ys g

ener

ally

lo

gica

l stru

ctur

e.

• co

ntai

ns p

arts

that

di

spla

y so

mew

hat

illog

ical

or u

ncle

ar

orga

niza

tion;

ho

wev

er, a

s a w

hole

, it

is in

gen

eral

lo

gica

lly st

ruct

ured

.

• of

ten

disp

lays

ill

ogic

al o

r unc

lear

or

gani

zatio

n, c

ausi

ng

som

e co

nfus

ion.

• di

spla

ys il

logi

cal o

r un

clea

r org

aniz

atio

n,

caus

ing

grea

t co

nfus

ion.

• di

spla

ys v

irtua

lly

non-

exis

tent

or

gani

zatio

n.

• at

tim

es d

ispl

ays

som

ewha

t loo

se

conn

ectio

n of

idea

s.

• di

spla

ys sm

ooth

co

nnec

tion

and

trans

ition

of i

deas

by

mea

ns o

f var

ious

co

hesi

ve d

evic

es

(logi

cal c

onne

ctor

s, a

cont

rolli

ng th

eme,

re

petit

ion

of k

ey

wor

ds, e

tc.).

• di

spla

ys g

ood

use

of

cohe

sive

dev

ices

that

ge

nera

lly c

onne

ct

idea

s sm

ooth

ly.

• di

spla

ys u

se o

f si

mpl

e co

hesi

ve

devi

ces.

• di

spla

ys re

petit

ive

use

of si

mpl

e co

hesi

ve d

evic

es; u

se

of c

ohes

ive

devi

ces

are

not a

lway

s ef

fect

ive.

• di

spla

ys a

ttem

pts t

o us

e co

hesi

ve d

evic

es,

but t

hey

are

eith

er

quite

mec

hani

cal o

r in

accu

rate

leav

ing

the

liste

ner c

onfu

sed.

• co

ntai

ns n

ot e

noug

h ev

iden

ce to

eva

luat

e.

Page 37: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

29Investigating the Construct Validity of a Speaking Performance Test

Task

Com

plet

ion

To w

hat e

xten

t doe

s the

spea

ker c

ompl

ete

the

task

?

5 E

xcel

lent

4

Goo

d 3

Ade

quat

e 2

Fair

1

Lim

ited

0 N

o

The

resp

onse

: Th

e re

spon

se:

The

resp

onse

: Th

e re

spon

se:

The

resp

onse

: Th

e re

spon

se:

• fu

lly a

ddre

sses

the

task

. •

addr

esse

s the

task

wel

l •

adeq

uate

ly a

ddre

sses

th

e ta

sk.

• in

suff

icie

ntly

ad

dres

ses t

he ta

sk.

• ba

rely

add

ress

es th

e ta

sk.

• sh

ows n

o un

ders

tand

ing

of th

e pr

ompt

. •

disp

lays

com

plet

ely

accu

rate

und

erst

andi

ng

of th

e pr

ompt

with

out

any

mis

unde

rsto

od

poin

ts.

• in

clud

es n

o no

ticea

bly

mis

unde

rsto

od p

oint

s. •

incl

udes

min

or

mis

unde

rsta

ndin

g(s)

th

at d

oes n

ot in

terf

ere

with

task

fulfi

llmen

t.

• di

spla

ys so

me

maj

or

inco

mpr

ehen

sion

/ m

isun

ders

tand

ing(

s)

that

inte

rfer

es w

ith

succ

essf

ul ta

sk

com

plet

ion.

• di

spla

ys m

ajor

in

com

preh

ensi

on/

mis

unde

rsta

ndin

g(s)

th

at in

terf

eres

with

ad

dres

sing

the

task

.

• co

ntai

ns n

ot e

noug

h ev

iden

ce to

eva

luat

e.

com

plet

ely

cove

rs a

ll m

ain

poin

ts w

ith

com

plet

e de

tails

di

scus

sed

in th

e pr

ompt

.

com

plet

ely

cove

rs a

ll m

ain

poin

ts w

ith a

goo

d am

ount

of

deta

ils d

iscu

ssed

in th

e pr

ompt

. (e

.g.,)

El

ectri

c C

ars:

two

prob

lem

s with

th

e cu

rren

t tec

hnol

ogy

(bat

tery

ru

nnin

g ou

t qui

ckly

and

in

conv

enie

nce

in re

char

ging

) B

arbi

zon

Scho

ol: 2

cha

ract

eris

tics

of th

e sc

hool

and

one

exa

mpl

e (p

aint

ed n

atur

e an

d es

tabl

ishe

d la

ndsc

apin

g as

an

inde

pend

ent

genr

e, a

nd th

e Fo

rest

in th

e su

nset

as

an

exam

ple)

OR

touc

hes u

pon

all m

ain

poin

ts, b

ut le

aves

out

de

tails

. OR

com

plet

ely

cove

rs

one

(or t

wo)

mai

n po

ints

with

det

ails

, bu

t lea

ves t

he re

st

out.

OR

touc

hes u

pon

bits

and

pi

eces

of t

he p

rom

pts.

Page 38: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

PB30 H. J. Kim

Inte

lligi

bilit

y Pr

onun

ciat

ion

and

pros

odic

feat

ures

(int

onat

ion,

rhyt

hm, a

nd p

acin

g)

5 E

xcel

lent

4

Goo

d 3

Ade

quat

e 2

Fair

1

Lim

ited

0 N

o

The

resp

onse

: Th

e re

spon

se:

The

resp

onse

: Th

e re

spon

se:

The

resp

onse

: Th

e re

spon

se:

• is

com

plet

ely

inte

lligi

ble

alth

ough

acc

ent

may

be

ther

e.

• m

ay in

clud

e m

inor

di

ffic

ultie

s with

pr

onun

ciat

ion

or

into

natio

n, b

ut

gene

rally

inte

lligi

ble.

• m

ay la

ck

inte

lligi

bilit

y in

pl

aces

impe

ding

co

mm

unic

atio

n.

• of

ten

lack

s in

telli

gibi

lity

impe

ding

co

mm

unic

atio

n.

• ge

nera

lly la

cks

inte

lligi

bilit

y.

• co

mpl

etel

y la

cks

inte

lligi

bilit

y.

• is

alm

ost a

lway

s cl

ear,

fluid

and

su

stai

ned.

• is

gen

eral

ly c

lear

, flu

id a

nd su

stai

ned.

Pa

ce m

ay v

ary

at

times

.

• ex

hibi

ts so

me

diff

icul

ties w

ith

pron

unci

atio

n,

into

natio

n or

pa

cing

.

• fr

eque

ntly

exh

ibits

pr

oble

ms w

ith

pron

unci

atio

n,

into

natio

n or

pa

cing

.

• is

gen

eral

ly

uncl

ear,

chop

py,

frag

men

ted

or

tele

grap

hic.

• co

ntai

ns n

ot e

noug

h ev

iden

ce to

eva

luat

e.

• do

es n

ot re

quire

lis

tene

r eff

ort.

• do

es n

ot re

quire

lis

tene

r eff

ort m

uch.

exhi

bits

som

e flu

idity

. •

may

not

be

sust

aine

d at

a

cons

iste

nt le

vel

thro

ugho

ut.

• co

ntai

ns fr

eque

nt

paus

es a

nd

hesi

tatio

ns.

• m

ay re

quire

som

e lis

tene

r eff

orts

at

times

.

• m

ay re

quire

si

gnifi

cant

list

ener

ef

fort

at ti

mes

.

• co

ntai

ns c

onsi

sten

t pr

onun

ciat

ion

and

into

natio

n pr

oble

ms.

• re

quire

s co

nsid

erab

le

liste

ner e

ffor

t.

Page 39: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

Spaan Fellow Working Papers in Second or Foreign Language AssessmentCopyright © 2010Volume 8: 31–68

English Language InstituteUniversity of Michiganwww.lsa.umich.edu/eli/research/spaan

31

Investigating the Construct Validity of the MELAB Listening Test through the Rasch Analysis and Correlated Uniqueness Modeling Christine Goh and S. Vahid Aryadoust National Institute of Education Nanyang Technological University, Singapore

ABSTRACT This article evaluates the construct validity of the Michigan English Language Assessment Battery (MELAB) listening test by investigating the underpinning structure of the test (or construct map), possible construct-underrepresentation and construct-irrelevant threats. Data for the study, from the administration of a form of the MELAB listening test to 916 international test takers, were provided by the English Language Institute of the University of Michigan. The researchers sought evidence of construct validity primarily through correlated uniqueness models (CUM) and the Rasch model. A five-factor CUM was fitted into the data but did not display acceptable measurement properties. The researchers then evaluated a three-traits1 confirmatory factor analysis (CFA) that fitted the data sufficiently. This fitting model was further evaluated with parcel items, which supported the proposed CFA model. Accordingly, the underlying structure of the test was mapped out as three factors: ability to understand minimal context stimuli, short interactions, and long-stretch discourse. The researchers propose this model as the tentative construct map of this form of the test. To investigate construct-underrepresentation and construct-irrelevant threats, the Rasch model was used. This analysis showed that the test was relatively easy for the sample and the listening ability of several higher ability test takers were sufficiently tested by the items. This is interpreted to be a sign of test ceiling effects and minor construct-underrepresentation, although the researchers argue that the test is intended to distinguish among the students who have a minimum listening ability to enter a program from those who don’t. The Rasch model provided support of the absence of construct-irrelevant threats by showing the adherence of data to unidimensionality and local independence, and good measurement properties of items. The final assessment of the observed results showed that the generated evidence supported the construct validity of the test.

1 In this article, the terms (latent) trait, factor, and construct have been used interchangeably.

Page 40: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

3332 C. Goh & S. V. Aryadoust

Introduction The Michigan English Language Assessment Battery (MELAB) was founded in 1985

to measure English language proficiency of nonnative English speaking applicants to American and Canadian universities and professional workers who need to produce a certificate of English proficiency or anyone interested in testing her/his English language proficiency (MELAB Technical Manual, 2003). The predecessor of the test, the Lado Test of Aural Comprehension (a multiple-choice listening test), evolved gradually into the current MELAB, which has been informed by ongoing research efforts supported by the University of Michigan. At present, the test is administered in 29 states in North America and encompasses three compulsory sections: (a) composition: 200–300 words (30 minutes); (b) listening comprehension: 50 questions (25 minutes); and (c) grammar, comprehension, vocabulary, reading: 100 questions (75 minutes). An optional speaking test that lasts about 15 minutes can also be taken by the test taker.

The listening component of MELAB has been researched in the past few years. The underlying factors of the test have been investigated by Eom (2008); Eom tested a model comprising language knowledge and comprehension and provided support for the hypothesized underpinning structure of the test. However, the methodological problem in the study was to covary error terms heavily without a theoretical support. Wagner (2004) also studied the factor structure of the listening subtests of MELAB and the Examination for the Certificate of Proficiency in English (ECPE); the MELAB study did not statistically separate the hypothetical underlying factors—the ability to understand explicitly and implicitly stated information—in the listening section successfully, indicating that this dichotomy can be an artifact.

The present study seeks to continue efforts of the validation of the MELAB listening test and to address some of the limitations in the earlier studies. The main questions in the study include (a) postulating and evaluating the construct map of the test, and (b) investigating construct representation and irrelevant factors or contaminants. According to Wilson (2005), a construct map is a modeled graphical representation of the underlying continuum of the construct that entails “a coherent and substantive definition for the content and the construct” (p. 26). The construct map, which “precipitates an idea or a concept,” is the representation of a unidimensional latent trait that we seek to measure. Every test measuring a construct has a construct map representing the components and structure of the construct. Towards this end, a latent variable model was used in the present study to help develop the construct map. For investigating construct representation and the presence of construct contaminants, we used the Rasch model (Bond, 2003; Haladyna & Downing, 2004). Needless to say that the latent variable model analysis used for the first question further informed us about the presence of construct irrelevant factors (Haladyna & Downing, 2004).

The present study begins with a review of selected listening comprehension literature and the technical manual of the MELAB test. Next, a listening model and a validation framework are proposed, and a content analysis of the test is conducted. Subsequently, models are generated to replicate the traits detected in the content analysis. This is followed by the Rasch investigation of item properties and the test. Results and findings from the analyses are then grounded in an evidence-based construct validity framework proposed by Chapelle (1994) because the framework concerns the “cornerstone of the definition” of validity, construct validity; other validation frameworks, such as Kane’s (1992, 2001) are useful but more general and need an extensive set of data and support of validity in a general sense, a concern which is out of the scope of this study.

Page 41: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

33Investigating the Construct Validity of the MELAB Listening Test through the Rasch Analysis and Correlated Uniquenes Modeling

The Listening Comprehension Construct in Assessment Scholarly literature on second language (L2) listening comprehension includes several

conventional models and frameworks. Richards’ (1983) taxonomy, Tatsuoka and Buck’s (1998) cognitive assessment through the rule-space technique, and Buck’s (2001) model are attempts to explore the constituent structure of the skill and students’ cognitive processes. Several researchers also explored the divisibility of listening comprehension from other language skills and reported controversial findings (Buck, 1992; Farhady, 1983; Oller & Hinofotis, 1980; Oller, 1978, 1979, 1983; Scholz, Hendricks, Spurling, Johnson, & Vandenburg, 1980). While these hypothetical models and taxonomies have deepened our understanding of the listening skill, there is a need to provide a clear and unifying definition of the skill.

Whereas listening comprehension was once assumed to be entirely a bottom-up process, later models posited that top-down processing takes place to understand implied messages. These perspectives on listening process have guided test developers and analysts in contemporary tests of L2 listening (Brindley, 1998; Buck, 1990; Rost, 1990; Tsui & Fullilove, 1998; Wagner, 2002, 2004). However, there is neither consensus over methods of testing listening skills nor an absolutely unified listening construct in terms of its definition (Dunkel, Henning, & Chaudron, 1993). For example, Glenn (1989) conducted a content analysis of 50 definitions of the listening construct and concluded that there was no universal agreement on the nature of this skill. Glenn further noted that this lack of agreement impeded research into listening assessment and even other areas where listening is involved, such as communication studies.

L2 researchers have used the two-level strategic comprehension model for discourse comprehension, which was originally proposed by Kintsch and van Dijk (1978) (also van Dijk & Kintsch, 1983) to define the listening construct (for example, Buck, 2002; Wagner, 2002; 2004). Kintsch and van Dijk’s (1978) theory is a mix of Kintsch’s research on psychology, which developed the concept of propositions, and van Dijk’s studies on functional linguistics, which introduced macro-operators. According to this model, comprehenders have two types of strategy to comprehend discourse: local and global coherence strategies. Local strategies connect components within sentences or clauses throughout the text to make sense of the text at a sentential level. Global strategies generate the “macrostructure”; it helps comprehenders explore the theme, main ideas and their interrelations, and the entire discourse structure. These two strategies do not operate independently: when comprehenders process consecutive clauses, they use local strategies to process meaning of individual utterances; simultaneously, they also use global strategies to ensure a comprehensive interpretation of the textbase which is being generated.

Using global strategies in L2 listening is sometimes taken synonymous with top-down information processing (Nation & Newton, 2009; Shohamy & Inbar, 1991). Top-down processes help listeners make inferences and expectations about the text structure. They are different from bottom-up processes which depend on local strategies. Bottom-up processing helps deciphering the phonological stimulus, and involves rebuilding individual sounds into words and constructing clauses. Kintsch and van Dijk’s (1978) model has been used in some L2 listening studies (Buck, 1993; Hansen & Jenson, 1994; Shohamy & Inbar, 1991). Shohamy and Inbar in particular emphasized that a competency-based approach to testing L2 listening should focus on top-down and bottom-up processing skills.

Similar to Kintsch and van Dijk’s (1983) model, Buck (1991, 1992, 1994, 2001) offered a listening construct encompassing the ability to understand explicitly articulated

Page 42: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

3534 C. Goh & S. V. Aryadoust

information and the ability to understand implicitly stated information; understanding explicit information is the ability to comprehend the verbal presentation of the message, and understanding implicit information is the ability of making inferences based on the world knowledge and schema. Buck (2001) refers to this model as a “default listening model,” stating that the model is general and flexible and can be expanded in various contexts. The validity of this model has been investigated via multivariate data analysis methods such as exploratory factor analysis (EFA) (Wagner, 2002, 2004) and confirmatory factor analysis (CFA) (Liao, 2007). Some researchers have argued that this model “delimits focus to the cognitive operation” of comprehension (Dunkel et al., 1993) and disassociates listening processes from higher, complex processes that concern cognition, such as synthesis and evaluation. However, Wagner’s (2004) factorial study, which was intended to show the discriminability (divisibility) of the ability to understand explicitly and implicitly articulated information, provided only limited evidence supportive of the two-factor model. Conversely, Liao (2007) reported that the variation in items of a listening test was accounted for by the two hypothesized latent traits. Liao also reported significantly high correlations between the two latent traits.

In another listening model proposed for the Test of English as a Foreign Language (TOEFL), Bejar, Douglas, Jamieson, Nissan, and Turner (2000) regarded L2 listening as a two-stage process: listening and response. In the listening stage, concurrent with hearing the aural message, pertinent situational knowledge (context role), linguistic knowledge (phonology, lexicon, morphology, and syntax), and background knowledge are activated to construct a set of simple statements or propositions; response takes two major forms: aural and written. According to Bejar et al., test developers should not base the test too heavily on either of these response types, because they can introduce construct irrelevant factors to the assessment of listening: if this stage overloads the mental processes of listeners, the measurement error will be overwhelming.

Some researchers tried to separate the listening construct from other language skills in an effort to demonstrate that listening is a separate trait (construct). Oller and Hinofotis (1980), Oller (1978, 1979, 1983), and Scholz, Hendricks, Spurling, Johnson, and Vandenburg (1980) used exploratory factor analysis (EFA) to isolate listening as a trait among other traits. However, EFA did not separate this trait. The researchers proposed that language proficiency is a unique and monolithic trait that cannot be partitioned. Interestingly, other researchers offered counterevidence and argued for the separability of language traits and listening (Buck, 1992; Farhady, 1983; Sawaki, Sticker, & Andreas, 2009).

This brief review shows that listening comprehension has different underlying processes. Wagner (2002) summarizes these processes as a general listening comprehension model comprising multiple major components: ability to understand details—indicative of bottom-up comprehension process—and large stretches of discourse (Buck, 2001; Richards, 1983), ability to comprehend major points or gist—recognized as top-down comprehension process—as stated by Richards (1983), ability to make inferences (Hildyard & Olson, 1978), and the ability to guess the meaning of unknown words from the context. We seek to investigate the operationalized MELAB listening construct in this study. We anticipate that we will identify some of these skills in the test.

Michigan English Language Assessment Battery Listening Test

The listening section of the Michigan English Language Assessment Battery (MELAB) has three parts, consisting of a total of 50 multiple-choice items in the entire test.

Page 43: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

35Investigating the Construct Validity of the MELAB Listening Test through the Rasch Analysis and Correlated Uniquenes Modeling

Test instructions are delivered to test takers and they are asked to answer questions which are read to them after hearing the stimuli. Following this, candidates choose the most appropriate response among three printed alternatives in the test booklet.

According to the Michigan English Language Assessment Battery technical manual, referred to as MELAB Technical Manual hereafter (English Language Institute of the University of Michigan, 2003), there are four test forms (DD, EE, FF, and GG); DD and EE forms are fairly older than the other two forms and are now retired. While the DD and EE forms comprised emphasis item type, conversations, and extended talks, the FF and GG forms include minimal context questions, short conversations, and long radio interviews. Emphasis items are retired and not used in the new test forms. In minimal context items, the listener assumes the role of an interlocutor to provide an answer to a question, invitation, etc., or to select the best paraphrase of a short utterance they have heard. Conversations, long talks, and radio interviews have a more extended context compared with minimal context items.

The principal aim of the test is summarized as follows (English Language Institute of the University of Michigan, 2003):

The listening test of the MELAB is intended to assess the ability to

comprehend spoken English. It attempts to determine the examinee’s ability to understand the meaning of short utterances and of more extended discourse as spoken by university-educated, native speakers of standard American English. It requires that examinees activate their schemata to interpret the meaning of what they hear and to use various components of their schemata to interpret the meaning of what they hear and to use various components of their linguistic system to achieve meaning from the spoken discourse, and presumes the activation of various comprehension abilities such as prediction, exploitation of redundancy in the material, and the capacity to make inferences and draw conclusions while listening. The test does not attempt specifically incorporate a variety of English dialects or registers but focuses on general spoken American English—conversational as well as semi-planned and planned speech, e.g., lectures based on written notes and radio interviews with topic experts. (p. 34) This paragraph is the principal resource identifying the types of listening

comprehension abilities that the MELAB listening test is intended to measure. Based on the description of the listening test above, the competencies examined are summarized as follows:

1. Ability to use the individual’s schemata to interpret meaning 2. Ability to use components of the individual’s linguistic system (e.g., grammar,

vocabulary, etc.) to construct their understanding 3. Ability to use a range of comprehension skills and strategies 4. Ability to make inferences and draw conclusions

Some of these competencies have been studied previously; as noted earlier, Wagner (2004) investigated the factor structure of long talks through exploratory factor analysis. In this study, he did not find strong evidence that this section of the test targets the ability of making inferences and understanding explicitly articulated information. Following this study and using a CFA model, Eom (2008) reported that language knowledge and comprehension are two underlying factors measured by the MELAB listening test. While the baseline latent

Page 44: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

3736 C. Goh & S. V. Aryadoust

trait model in the study did not fit the data well, Eom allowed the error terms to covary heavily. This measure can improve the fit of the model, but it will also yield a less parsimonious model (lower degrees of freedom). A parsimonious model is less complex in terms of the relationship between indicators, error terms, and latent variables and is able to efficiently explain the underlying cognitive processes of the test. That is, the more paths we add to the model, the better the fit, but the lower the parsimony; good fit in un-parsimonious models does not always translate into a well-fitting model (Raykov & Marcoulides, 1999; Schumacker & Lomax, 2004). Because the modification does not appear to be directly informed by theory in Eom’s study, the implications of the study are limited. In the present study, we seek to provide a less complex (more parsimonious) model which captures the complexity of the MELAB listening construct and approximates the cognitive processes of the test takers.

The Study Objectives of the Study

The major objectives of the study are:

1. To determine the underpinning factor structure or the construct map of the MELAB listening test.

2. To determine construct-underrepresentation and construct-irrelevant threats to the construct validity of the listening test, if any.

Methodology Participants

A data set of the performance of 916 candidates who took the MELAB test was provided by the English Language Institute (ELI) of the University of Michigan. Although the participants in the test were from 78 countries, the ELI announced that the data are not from all countries where the test is administered. All test takers whose data were used in the current study took the same test. Of these, 425 were female and 427 male; the information on gender of 64 people was missing. Materials

The ELI provided the test materials, including the scripts of the audio stimuli and 50 items which were of three types: (a) 15 minimal context items, (b) 20 short conversation items, and (c) 15 long radio interview items (three interviews). In minimal context items, test takers should choose the correct response to an invitation, offer, etc. The following example is from the MELAB information and registration bulletin (2009):

You hear: When’s she going on vacation? You read: a. last week b. to England c. tomorrow The correct answer is c, tomorrow. (p. 8) The MELAB Technical Manual refers to this item type as “minimal context items”

(English Language Institute of the University of Michigan, 2003). The manual states that these questions measure different aspects of comprehension at the item and test levels; on the one hand, they assess the ability to comprehend the “conversational patterns” in daily spoken

Page 45: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

37Investigating the Construct Validity of the MELAB Listening Test through the Rasch Analysis and Correlated Uniquenes Modeling

English, and they test the ability to understand new information on the other. An element of predicting what the other interlocutors would say is fundamental to answering the items in minimal context tests.

The MELAB Information and Registration Bulletin (English Language Institute of the University of Michigan, 2009) explains that short conversation items evaluate the understanding of test takers of short conversations or talks. An example of short conversations follows:

You hear: A: Let’s go to the football game. B: Yeah, that’s a good idea. I don’t want to (wanna) stay home. You read: a. They’ll stay home. b. They don’t like football. c. They’ll go to a game. (p. 8) This item assesses the comprehension of a longer stretch of discourse. Understanding the

illocutionary forces of the items (e.g., requests, invitations, apologies, etc.) alongside the literal meaning in conversations is necessary to successfully answer these items (MELAB Technical Manual, 2003). In order to interpret the illocutionary meaning of these exchanges, the candidate will have to make inferences and draw conclusions where needed.

In the final part of the test are longer audio inputs. In this section, simulated radio talks and conversations are delivered to test takers. They are allowed to take notes. The presence of “graphic materials” serves to further contextualize the aural input. As a general observation, the printed options throughout the test are short, ranging from two to seven words each. This helps minimize the use of graphological knowledge and the effects of reading skills on listening. The MELAB Technical Manual states that grammatical, textual, lexical, functional, and sociolinguistic knowledge are the principal components of the comprehension items. Analysis

According to Kirsch and Guthrie (1980), the notion of validity is dependent upon “the congruence between the stated purpose of the test and what is being measured by the test” (p. 90). To investigate this congruence, Messick (1989) and other researchers suggest that researchers use such statistical methods as factor analysis; Borsboom et al. (2004) proposed latent trait models; and Wright and Stone (1999) recommended the Rasch model (also, see Bachman [2004] for a review of quantitative methods of validation). The present study seeks to use the following methods:

1. A particular confirmatory factor analysis modeling approach known as correlated uniqueness model for building a construct map

2. Rasch measurement for investigating construct underrepresentation and construct irrelevant threats

Before the results of the analysis are reported, we present the proposed listening models

for MELAB, as well as explain the rationale for the use of the two methods of statistical analyses. In the results section, we will also describe the compensatory strategies employed and the results for each type of analysis before arriving at a conclusion for each research question.

Page 46: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

3938 C. Goh & S. V. Aryadoust

Proposed Listening Models for MELAB Figure 1 illustrates a conceptualized listening construct in the MELAB listening test

based on the Test Aims paragraph (see next paragraph). As stated previously, the ability to use local and global coherence strategies which help comprehend explicitly and implicitly stated information are two components of a listening trait. We use the explicit/implicit terminology because it is used in MELAB literature. The ability to make propositional and enabling inferences alongside understanding context-reduced stimuli, explicit aural input, and making close paraphrases are present in the three test models in Figure 1 (see the Materials section for a review of the test). A major objective of the MELAB listening test is to assess the ability to “make inferences and draw conclusions,” which we have divided into propositional and enabling inferences based on L2 literature (Hildyard & Olson, 1979). Likewise, we divided the ability to understand explicit information into close paraphrase and detailed information. Figure 1 presents three listening models for the MELAB listening test. In each model: big circles represent latent traits; boxes represent observed variables or test items; error terms are displayed as small ellipses with arrows pointing to boxes; regression paths are indicated as one-headed or unidirectional arrows; and correlations are indicted as two-headed or bidirectional arrows. In each model, only ten items are displayed for reasons of space.

Model 1 will be modified if it does not fit the data satisfactorily. It is hypothesized that there are five separable traits underpinning the test in Model 1 and that five types of items measure test takers’ trait levels: minimal context questions (MCQ), detailed or explicit information questions (DIQ), close paraphrase questions (CPQ), propositional inference questions (PIQ), and enabling inference questions (EIQ); all of these traits are correlated. This model is a correlated uniqueness model (CUM) with uncorrelated error terms because the error terms are not free, meaning that they are not covarying. If we free the error terms (covery them using double-headed arrows), Model 2 is generated.

Model 2 is proposed in the event that the first CFA model does not fit the data. This carries the implication that there is a method factor effect in the data. Model 2 is similar to Model 1 except that it allows error terms to correlate. Correlation is indicated by the double-headed arrows covarying (connecting) error terms. Some error terms do not correlate with others: methods in Model 2 comprise two types which are implied in the MELAB Technical Manual, i.e., short stimuli method and long stimuli method. Based on this definition, we consider items in section 1 and 2 the short stimuli method and items in section 3 representing the long stimuli method and covary their error terms. The justification for Model 2, with correlated traits and error terms (uniquenesses), is that the observed variance in data is assumed to be a joint function of traits and methods. Accordingly, we expect an increment in the fit of the model when we free (covary) the error terms if there is a method effect in the model. The error correlations, in turn, help us model the shared method variance which is unique to the measuring tool (see Bachman, 2004, p. 283–287).

While some studies show that listening comprehension is a general and nondivisible latent factor (Wagner, 2004), some studies have separated theory-informed factors (Buck, 1992, Hansen & Jenson, 1994). It is likely that there is a two-level latent trait, one level as a general listening trait and the other as listening components. This hypothesis is evaluated as a model competing with Model 1 and 2. Model 3, which is a second-order CFA model, has a major latent variable whose indicators are also latent; that is, there are “two layers of latent constructs” (Hair, Black, Babin, & Anderson, 2010, p. 815) in which the higher order latent variable—in our model, the listening construct—cause lower order latent variables—in our

Page 47: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

39Investigating the Construct Validity of the MELAB Listening Test through the Rasch Analysis and Correlated Uniquenes Modeling

model, minimal context questions (MCQ), detailed or explicit information questions (DIQ), close paraphrase questions (CPQ), propositional inference questions (PIQ), and enabling inference questions (EIQ). Therefore, the distinctive feature of Model 3 is the presence of a higher order factor (a general listening ability) which is hypothesized to cause the proposed separate traits. If method effect is present, then Model 3 (on the next page) will not display good fit, indicating a CUM with covarying error terms is more suitable to explain the underpinning structure of the MELAB listening construct. Competing models may fit the data equally well, but the best one would be the most parsimonious and theory-informed model.

Statistical Analysis

Confirmatory Factor Analysis The present study employs a confirmatory factor analysis approach (CFA) to build a

construct map representing the constituent structure of the test. It provides evidence for the power and specification of the a priori factor model (Schmitt & Stults, 1986; Brown, 2006). Conway, Lievens, Scullen, and Lance (2004) classified CFA into four variants: (a) linear-additive CFA (Widaman, 1985, 1992), (b) hierarchical second-order factor (SOF) CFA (Marsh & Hocevar, 1988), (c) correlated uniqueness models (CUM) (Kenny, 1976; Marsh, 1989, Brown 2006), and (d) direct product (DP) of multiplicative trait-method effects (Campbell & O’Connell, 1967; Cudeck, 1988). CUM is used in the present study to solve the problem of multitrait-multimethod (MTMM) matrix. Proposed by Campbell and Fisk (1959) to examine construct validity, MTMM is a matrix assuming that each factor or trait is measured by several methods and the matrix is arranged in a way that traits are entailed in methods, i.e., the matrix is laid out by method blocks, each comprising at least three trait cells. The method leads the researcher to multiple evidence of construct validity, most notably the correlation between tests assessing the same trait (convergent validity) which should be high, and the correlation between tests assessing different traits (divergent validity) which should be low (Bachman, 2004). There are some problems in the analysis of MTMM, such as negative degrees of freedom, non-positive definite matrices (see Schumacker & Lomax, 2004), and that each trait in the matrix should be assessed by at least three methods. Correlated uniqueness modeling is proposed as an alternative less demanding approach to solve the MTMM matrix. To produce a CUM, researchers need to define at least two factors (F) to be measured by three methods (M). But a 2F×2M model may fit when the factor loading indices of indicators (items) loading on the same latent variable are constrained to be equal (Brown, 2006, p. 220).

Like other latent trait models, CUM has faced criticisms in its literature. Whereas some researchers have recommended CUM for solving MTMM matrices (Brown, 2006; Lievens & Conway, 2001; Marsh, 1989), others argue that CUM estimates are biased (Lance, Noble, & Scullen, 2002). However, this bias has been shown to be trivial. Marsh and Baily (1991) used simulated data to explore the bias and concluded that the purported bias is not significant. The study by Tomás, Hontangas, and Oliver (2000) provided further evidence for Marsh and Baily’s finding. In summary, in contexts where there are only two latent traits and two methods, CUM is a good way to solve the matrix (Brown, 2006).

Page 48: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

4140 C. Goh & S. V. Aryadoust

Figure 1. Three Proposed Models for the MELAB Listening Test. Only ten items are displayed in this figure for reasons of space. Model 1 with uncorrelated error terms (uniquenesses) is included mainly as a baseline model. Model 2 with correlated error terms (uniquenesses) is the modified model. Model 3 is a second-order model with listening as the higher order trait. (Legend: MCQ = minimal context questions. DIQ = detailed information questions (explicit). CPQ = close paraphrase questions. PIQ = propositional inference questions. EIQ = enabling inference questions.)

Page 49: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

41Investigating the Construct Validity of the MELAB Listening Test through the Rasch Analysis and Correlated Uniquenes Modeling

Rasch Measurement We use the Rasch model to investigate construct underrepresentation and construct

threats in the present study. This model has two central features: (a) expected probabilities for persons and items and (b) fit indices to argue how persons and items fit the model. These features are valid when the local independence and unidimensionality criteria are established (McNamara, 1996). The Rasch model provides item and person measures based on the mathematical modeling of the data. The basic model from which we derive other models is:

ønil =

where ønil is person n’s probability of scoring 1 or answering item i correctly, is the ability of person n on the entire test, and is the difficulty level of item i. According to the model, the probability of success in answering an item is governed by person ability and item difficulty. The Rasch model can help investigate the construct validity of measuring tools by providing the opportunity for investigating construct-irrelevant variance (CIV) and construct underrepresentation (CUR) which are discussed in Messick (1989).The Rasch model is also suitable for assessing item bias, which is a major source of CIV across sub-samples. This analysis is a test of invariance which is also known as Differential Item Functioning (DIF). At the item level, DIF detects items which function significantly differently in different groups and flags them for further analysis and deletion (Bond & Fox, 2007; Wright & Stone, 1999). At the test level, it identifies the covariates that have contaminated the measurement thus introducing some construct-irrelevant variance to the measurement.

Results

Descriptive Statistics and Reliability Analysis

We examined the quantitative features of the data. Descriptive statistics summarize the features of the data in an understandable and concise way. We calculated mean, standard deviation (SD), skewness, and kurtosis using the Statistical Package for the Social Sciences (SPSS) computer program, Version 16 (see Table 1). The table presents items in three test sections; section one contains minimal context questions (items 1–15); section two entails short conversations (items 16–35); and section three contains simulated long radio interviews (items 36–50).

Normality of the observed data should hold in factor analysis studies. Univariate normality was investigated through the analysis of the skewness and kurtosis of data. Normal distributions have a skewness index of zero although a range of -2 to +2 is an acceptable span (Bachman, 2004). (Sometimes some random errors occur, which can change the value). Kurtosis is the degree of flatness or peakedness. Negative values indicate a fairly flat distribution and positive values indicate that the shape of the data has a high peak. Item 8 had skewness and kurtosis indices greater than |2|, indicating that it was slightly deviating from the properties of a normal distribution (Bachman, 2004). This item also had the highest mean but the lowest SD index. Other items did not display unusual skewness and kurtosis values, an indicator of the normality of distribution.

Next, using KR-21 formula, we investigated the reliability or internal consistency of the observed scores. Internal consistency indicates how much of the variation in observed

Page 50: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

4342 C. Goh & S. V. Aryadoust

scores is attributable to error and how much to the true score. Respective KR-21 indices for sections 1, 2, and 3 are 0.65, 0.71, and 0.65. We also computed KR-21 indices for items according to the results of content analysis, as displayed in Table 2: detailed (explicit) information: 45.21 (6 items); close paraphrase 0.60 (13 items); propositional: 0.50 (9 items); enabling: 0.42 (7 items); minimal context 0.65 (15 items). Additionally, KR-21 for all items assessing explicitly said information (19 items) was 0.68 and for items assessing implicitly said information (16 items) was 0.64. According to Pallant (2007), low reliability indices are indicators of a small number of items, resulting in high measurement errors. Therefore, when the number of well-designed items in analysis increases and, measurement error drops, the reliability index tends to increase. The reliability index for the entire test was 0.85. The reliability index of 0.85 is very close to the average KR-21 index of 0.81 and closer to the average reliability index of 0.87 for candidates intending to further their education, as stated in the MELAB Technical Manual (2003).

Table 1. Descriptive Statistics for the MELAB Listening Test

Items

Mean SD Skewness Kurtosis

V1 .48 .50 .08 -1.99 V2 .53 .50 -.10 -1.99 V3 .64 .48 -.60 -1.64 V4 .74 .44 -1.09 -.79 V5 .70 .46 -.86 -1.25 V6 .45 .50 .193 -1.96 V7 .59 .49 -.35 -1.88 V8 .87 .33 -2.22 2.93 V9 .80 .40 -1.50 .263 V10 .80 .40 -1.48 .212 V11 .73 .44 -1.06 -.86 V12 .49 .50 .048 -2.00 V13 .55 .50 -.22 -1.95 V14 .53 .50 -.12 -1.98 V15 .75 .43 -1.14 -.69 V16 .54 .499 -.158 -1.979 V17 .50 .500 -.017 -2.004 V18 .59 .491 -.383 -1.857 V19 .60 .489 -.429 -1.820 V20 .61 .487 -.462 -1.790 V21 .55 .498 -.207 -1.962 V22 .62 .486 -.481 -1.772 V23 .67 .469 -.742 -1.453 V24 .79 .406 -1.446 .090 V25 .68 .466 -.779 -1.396 V26 .40 .490 .406 -1.839 V27 .52 .500 -.066 -2.000

Page 51: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

43Investigating the Construct Validity of the MELAB Listening Test through the Rasch Analysis and Correlated Uniquenes Modeling

Items

Mean SD Skewness Kurtosis

V28 .67 .470 -.736 -1.461 V29 .60 .491 -.401 -1.843 V30 .69 .463 -.823 -1.326 V31 .62 .485 -.515 -1.739 V32 .69 .464 -.806 -1.353 V33 .76 .429 -1.204 -.550 V34 .76 .430 -1.191 -.584 V35 .63 .483 -.544 -1.708 V36 .53 .500 -.105 -1.993 V37 .67 .469 -.747 -1.445 V38 .54 .499 -.149 -1.982 V39 .61 .487 -.472 -1.781 V40 .58 .494 -.314 -1.905 V41 .62 .485 -.510 -1.744 V42 .69 .461 -.839 -1.298 V43 .78 .412 -1.381 -.092 V44 .50 .500 .017 -2.004 V45 .84 .370 -1.820 1.316 V46 .77 .423 -1.268 -.392 V47 .65 .478 -.618 -1.622 V48 .67 .472 -.705 -1.507 V49 .47 .499 .123 -1.989 V50 .66 .474 -.674 -1.549

Note. n = 916 in the sample. The first section contains minimal context questions (items 1-15). The second section contains short conversations (items 16-35). The third section contains simulated long radio interviews (items 36-50). Content Analysis

From a competency-based viewpoint, the Test Aims paragraph in the MELAB Technical Manual assumes that the test measures different listening skills. We conducted a content analysis initially and categorized the items into five types. This analysis is informed by previous research as noted earlier (Buck, 2002; Hansen & Jensen, 1994; Shohamy & Inbar, 1991; Wagner, 2002) as well as the MELAB Technical Manual. These five categories are:

1. minimal context items 2. explicit items (close paraphrase) 3. explicit items (detailed information) 4. implicit items (propositional inferences) 5. implicit items (enabling inferences)

Page 52: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

4544 C. Goh & S. V. Aryadoust

As demonstrated in Table 2, these five item types are classified into three major categories: according to the MELAB Technical Manual, minimal context items assess the ability to understand unexpected; according to Hildyard and Olson (1979), Hansen and Jensen (1994), and Wagner, (2002), the ability to understand detailed information and making close paraphrases is a general ability subsumed under the comprehension of explicitly said information; and the ability to make propositional and enabling inferences is subsumed under the ability to comprehend implicitly stated information (Hildyard & Olson, 1979). Therefore, three major skills—understanding the unexpected and assessing explicitly/implicitly said information—subsuming five item types are presented in Table 2.

A content analysis of items and texts was performed to map the items on the posited factor structure of minimal context, explicit information, close paraphrase, propositional inferencing and enabling inferencing. This stage in validation provides content-referenced evidence for the validity of the test. We performed three rounds of content analysis to increase the internal reliability of findings. Each researcher conducted a round of analysis separately. In the final phase, we discussed the item characteristics and the skills they assessed, based on the Test Aims paragraph. Any doubtful classification of items was further reviewed by both authors for a final decision on the classification of the items. Table 2 provides a summary of the findings.

Table 2. Results of the Content Analysis of the Items and Texts

Understanding the unexpected

Assessing explicitly said

information

Assessing implicitly said

information

Minimal context

Detailed

information

Close

paraphrase

Propositional inferencing

Enabling

inferencing

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 14, 15

40, 46, 47, 48, 49, 50

18, 21, 30, 35, 36, 37, 38, 39, 41, 42, 43, 44,

45

16, 22, 24, 25, 26, 27, 29, 31,

33

17, 19, 20, 23, 28, 32, 34

Note. This table presents three major skills—understanding the unexpected and assessing explicitly/implicitly said information—subsuming five item types—minimal context, explicit information, close paraphrase, propositional inferencing, and enabling inferencing.

Minimal context items in Table 2 (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, and 15), assess the ability to comprehend minimally contextualized colloquial language especially the components of linguistic system like grammar, vocabulary, and idioms, and the ability to respond to the stimulus. Similarly, items which assess understanding detailed information (40, 46, 47, 48, 49, 50) and making close paraphrases (18, 21, 30, 35, 36, 37, 38, 39, 41, 42, 43, 44, and 45) require using the linguistic system alongside schema. Assessing the capacity to make inferences and drawing conclusions is a skill required in other items: propositional inferencing (16, 22, 24, 25, 26, 27, 29, 31, and 33) and enabling inferencing (17, 19, 20, 23,

Page 53: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

45Investigating the Construct Validity of the MELAB Listening Test through the Rasch Analysis and Correlated Uniquenes Modeling

28, 32, 34) (MELAB Technical Manual). The majority of items in section 2 belong to the detailed and close paraphrase categories, whereas long interviews in section 3 (in particular the last interview) assess understanding detailed information. Confirmatory Factor Analysis

To examine the fit of the postulated models into the data, we carried out a series of CFA to test the proposed models (as displayed in Figure 1), their fit indices, and parsimony. We used several different fit indices to investigate the fit of the proposed models: the root mean square error of approximation (RMSEA) is an index which displays how well a model fits a population. This index should be smaller than 0.08 and ideally smaller than 0.05 (Hair et al., 2010). To explore the precision of the RMSEA, the RMSEA 90% confidence interval is reported. The interval between the lower and higher bounds of this value should be as narrow as possible (Byrne, 2001). The chi-square (χ

2) index is a comparison between the correlation

matrix implied and the correlation matrix produced. If they are significantly different, then the χ

2value is significant. Normal χ

2 was also used, a ratio of sample discrepancy (χ

2) to the

degree of freedom; better-fitting models generally have a ratio below 3. It should be noted that the χ

2 index is sensitive to the sample size: large or small samples may produce

significant values. Therefore, other indices have been developed to further examine the fit of the model to the data (Miles& Shevlin, 2007; Steiger, 2007; Hair et al., 2010). We further used (a) CFI (Comparative Fit Index, an incremental index evaluating the fit of a model to data relative to a baseline model), (b) GFI (Goodness of Fit Index, an absolute fit index developed to solve the sensitivity of the chi-square index to the sample size), (c) NNFI (Non-normed fit index, also known as Tucker-Lewis Index, basically very similar to the CFI; used to compare the proposed model and the baseline model).

In the first attempt, we postulated a five-factor CUM with no correlation among its error terms, as illustrated in the first model in Figure 1. We used the PRELIS application to produce an asymptotic covariance matrix and a matrix of polychoric correlations for the ordinal data (Du Toit & Du Toit, 2001) because Pearson correlation matrix is not suitable for the factor analysis of dichotomous data (Uebersax, 2006). The underlying assumption in the matrix of polychoric correlations is that the underlying variable is continuous but the data is dichotomous and/or ordinal (Uebersax, 2006). Then, the LISREL software version 8.8 (Jöreskog & Sörbom, 2006) was used to construct and test the model (simplified as Model 1 in Figure 1). The five-factor model did not fit the data satisfactorily.

In a post hoc modification stage, we tried to isolate the test methods measuring separate factors in the test. As implied in the MELAB Technical Manual and noted above, the five factors identified are measured by two major test methods: items in section 1 and 2 of the test are the short stimuli method and items in section 3 represent the long stimuli method. Therefore, to generate Model 2 in Figure 1, short stimuli items should measure the same factors and long stimuli items should measure different factors. Yet, as Table 2 shows, there is no clear pattern of measurement in long and short items, indicating that the construction of Model 2 is impossible; therefore, a CUM solution with correlated error terms and factors is not possible.

Next, a second order CFA was performed to investigate whether a major listening factor can cause lower order factors. This model is simplified and displayed as Model 3 in Figure 1. Table 3 summarizes the properties of the CFA models.

Page 54: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

4746 C. Goh & S. V. Aryadoust

Table 3. CFA Models to Confirm the Factor Structure of the MELAB Listening Test

Note. n = 585 in the sample. **p < 0.001. *p < 0.01. NNFI: Non-Normed Fit Index. CFI: Comparative Fit Index. Goodness of Fit Index: GFI. RMSEA: Root Mean Square Error of Approximation. df: degree of freedom. In the CUM model, traits are correlated and error terms are uncorrelated.

According to Table 3, the first five-factor model (CUM) does not fit the data well (NNFI = 0.43; CFI = 0.43; GFI = 0.93; RMSEA = 0.021). This model has a significant χ

2(1548.48), an acceptable normal χ

2 (χ

2/df) of 1.33; its CFI and NNFI are very low although

the root mean square error of approximation (RMSEA) is acceptable. The summary of the item loading statistics is available from Appendix 1. Table 4. Bivariate Correlations of Traits in the CUM

Note. The CUM model in this table has correlated error terms and correlated traits. Min = minimal. Para = paraphrase. Proposition = propositional inference.

As Table 4 presents, another problem with the CUM model (Model 1) is the emergence of unreasonably high correlation statistics among traits, which are greater than 1.00; the correlation matrix in this case is “non-positive definite”, indicating that “the determinant of matrix is zero or the inverse of the matrix [which is used to estimate the parameters] is not possible” (Schumacker & Lomax, p. 47). Therefore, the solution is not admissible and parameter estimations are not correct.

In Figure 1, Model 3 presents a simplified higher-order model with fewer items and Table 5 displays the fit indices (χ

2 = 1588.36; NNFI = 0.96; CFI = 0.96; GFI = 0.93; RMSEA

= 0.021). This model has good fit indices but a significant χ2value. The observed problem in

this model is the presence of extremely high loading indices of the lower order factors on the higher order factor (minimal context: 0.96; close paraphrase: 1.01; Explicit: 0.85; propositional inference: 1.11; enabling inference: 0.99). This also indicates that, like Model 1, the correlation matrix in this case is non-positive definite. Accordingly, we adopted a

Model χ2 df χ

2/df NNFI CFI GFI RMSEA RMSEA 90%

confidence interval CUM 1548.48** 1165 1.33 0.43 0.46 0.93 0.021 0.029 to 0.033

2nd order model 1588.36* 1122 1.41 0.96 0.96 0.93 0.021 0.018 to 0.023

Three-factor CFA 1638.88* 1172 1.39 0.96 0.96 0.93 0.021 0.018 to 0.023 Parcel Items CFA 197.52 149 1.32 0.99 0.98 0.99 0.019 0.011 to 0.026

Constraint tenable Non-sign. ___ < 3 .95 .95 .95 < 0.08 Narrow interval

Min context

Explicit Close_Para Enabling Proposition

Min_context 1.00 Explicit 1.04 1.00 Close_Para 0.94 1.07 1.00 Enabling 0.78 0.97 0.84 1.00 Proposition 0.98 1.07 0.96 0.86 1.00

Page 55: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

47Investigating the Construct Validity of the MELAB Listening Test through the Rasch Analysis and Correlated Uniquenes Modeling

compensatory strategy to fit a better CFA model into the data; because the models tested above did not fit the data, we opted for a model based on the structure of the MELAB listening test, which is further explained below.

First Compensatory Strategy

We took another approach to redefine the basis of the CFA model. The models produced in Figure 1 depend on a more competency-based definition of listening, which is drawn from the Test Aims paragraph. A task-based theoretical framework highlights the tasks that candidates will encounter in real life situations, and also considers the theory of the construct (Bachman, 2002). According to the MELAB Technical Manual, the MELAB listening test targets three major tasks in three sections: understanding and responding to (a) the unexpected requests, invitations, offers, etc., (b) short conversation items, and (c) longer talks or radio interview items, which resembles the factor analysis stated in MELAB Technical Manual (English Language Institute of the University of Michigan, 2003, p. 46).

Based on this classification, we performed a CFA to investigate the fit of the three-factor model to the data. Results are demonstrated in Table 3; the three-factor model has similar properties as the 2nd order model (χ

2=1638.88; NNFI=0.96; CFI=0.96; GFI=0.93;

RMSEA=0.021) but it does not display the problem of correlation coefficients greater than 1. As shown by two-headed arrows connecting the latent traits—MinimCon (minimal context items representing unexpected requests), ShortCon (short conversation items), and LongTalk (longer talks or radio interview items)—in Figure 2, the three-factor model has acceptable correlation coefficients among traits: the correlation indices do not exceed 1.00 and are greater than 0.70 (Hair et al., 2010); the model consisting of three factors (minimum contexts, short conversations and long talks) fits the data satisfactorily.

Page 56: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

4948 C. Goh & S. V. Aryadoust

Figure 2.Three-factor Model of the MELAB Listening Test. Oval shapes represent latent traits and rectangles represent the measured variables or items.

Page 57: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

49Investigating the Construct Validity of the MELAB Listening Test through the Rasch Analysis and Correlated Uniquenes Modeling

Second Compensatory Strategy The second strategy is based around computing parcel scores or testlets. The model is

presented in Figure 3. Following the MELAB Technical Manual, we constructed “short question odd-numbered items, short question even-numbered items, short conversational exchange odd-numbered items, short conversational exchange even-numbered items, three testlets for the three radio interview sets of items” (MELAB Technical Manual, 2003, p. 47); we summed up odd-numbered items and then even-numbered items, and the five items testing the comprehension of a long radio interview; so, we built seven aggregate (parcel) items for section 1, nine items for section 2, and three items for section 3.

Figure 3.Three-factor Model Based on Parcel Scores. Oval shapes are latent traits and rectangles are the measured parcel variables. Of all proposed models, the section-based model (Figure 3) with testlets (parcel scores) fit the data the best (NNFI=0.99; CFI=0.99; GFI=0.98; RMSEA=0.019). This model shows that the correlation coefficients of the proposed factors are sufficiently high. Resonant

Page 58: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

5150 C. Goh & S. V. Aryadoust

with the MELAB Technical Manual, this observation testifies to the presence of a firm three-factor construct which underpins this version MELAB listening test. Rasch Analysis of the Test

We performed two Rasch-based analyses to investigate the item and person measurement properties. Initially, we calibrated the 50 items simultaneously (concurrent calibration) and checked the item fit statistics to see if these items can construct a scale. Person and item measures were generated in this analysis. In the second round, we conducted differential item functioning for gender.

First Rasch analysis. We used WINSTEPS package version 3.57 (Linacre, 2005) to fit

the Rasch model into the data. Person and item reliability indices were 0.84 and 0.98, respectively. Separation indices were 2.30 and 7.30 for persons and items. The reliability index is evidence for the internal consistency of the person ability indices and item difficulty measures. Separation values are “the ratio of “true” variance to error variance” (Linacre, 2009, p. 462). This is another expression of reliability; ranges from 0 to infinity; and indicates the number of performance levels in the test or heterogeneity of people. Item reliability and separation index point to the ability of the measuring device to establish a similar item hierarchy along the variable in a similar sample from the same population; the item reliability of 0.98 indicates that the item estimates would be reproducible in a similar sample.

Next, Rasch item difficulty and person ability measures were computed. Figure 4 is an item-person map (or Wright map) which plots person ability against item difficulty. Items are laid out on the right side according to their difficulty measure and test takers on the left. The distribution of persons is consistent, making a curve-like shape which peaks around the mean. Person ability and item difficulty mean estimates were 0.68 and 0.00, respectively (in this analysis, the mean of items was anchored to 0.00; the person mean is 0.68 logits higher than the anchored item mean). This is an indicator that items were relatively easy for this sample of test takers. The SD indices for persons and items were 0.87 and 0.57, respectively. Figure 4 also demonstrates that some of the candidates with greater demonstrated ability (in red) did not get sufficient questions in the test that can further distinguish their ability levels (this observation is further examined in Figure 5). As will be discussed below, this inflates the standard error of measurement in the estimated ability measures.

To assess the fit of the Rasch model to the data, we examined infit mean-square (information-weighted mean-square statistic which is more sensitive to the unexpected behavior of items closer to persons’ measures) and outfit (unweighted mean-square sensitive to outliers). Mean-square (MNSQ) is computed as the chi-square value divided by the degree of freedom. MNSQ fit indices show useful, as opposed to perfect, fit of the data to the model. An infit MNSQ of, say, 1.2 means 1 unit of modeled information is observed and 0.2 units of unmodeled noise sneaks in. The t-test significance (ZSTD) is used to investigate the perfect fit of the data to the model (acceptable range: |2|). In a sample size greater than 250, the infit ZSTD tends to exceed |2|. Therefore, Linacre (2003) recommended that researchers consider MNSQ indices in large samples to show that the Rasch model fit the data usefully. Another advantage of MNSQ over ZSTD is that as the sample size increases, the MNSQ power to find discrepancies in the data increases (Linacre, 2003). Bond and Fox (2007) considered 0.6—1.4 an acceptable infit MNSQ range (similar to Linacre’s (2003, 2009) recommendation of 0.5—1.5 for productive measurement).

Page 59: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

51Investigating the Construct Validity of the MELAB Listening Test through the Rasch Analysis and Correlated Uniquenes Modeling

PERSON - MAP - ITEM <more>|<rare> 5 .# + | | | | | | | 4 . + | | | | ## | | | 3 + . | | .## | | .# T| | .#### | 2 + .# | .### | .### | .##### S| #### | .###### | .### |T V26 1 .###### + .############ | V6 .###### | V1 V12 V49 .#### M|S V17 V2 V27 V36 V44 .########## | V14 V16 V21 V38 .###### | V13 V40 .###### | V18 V19 V29 V7 .############ | V20 V22 V31 V39 V41 0 .###### +M V3 V35 V47 V50 .######### | V23 V28 V37 V48 .### S| V25 V30 V32 V42 V5 ####### | ## | V11 V4 .# |S V15 V33 V34 .## | V43 V46 . | V10 V24 V9 -1 . + T|T V45 . | . | | V8 | | | -2 + <less>|<frequ> EACH "#" IS 6. EACH "." IS 1 TO 5

Figure 4.Rasch Analysis Performed on 50 Items. Each “#” sing represents seven persons in the sample.

Item fit statistics and difficulty measures are summarized in Table 5. The score

column expresses the raw score assigned to each item according to students’ performance and the measure is the converted raw score into logits (log-odds units). Standard errors (SE) indicate the imprecision of the item locations. The lower the SE, the higher the confidence in the location of item difficulty measures. Inflated SE indices are observed when there are not enough items to measure people’s ability or when a test is administered to a small sample. According to Table 5, Infit and outfit MNSQ indices have an acceptable range (0.6—1.4). This is an important indicator of the lack of erratic responses and validity of scores. That is, as MNSQ indices show, there may be only few outliers (low ability people who unexpectedly answered a difficult item and high ability people who did not get an easy item right) that affect the Rasch model. Further, the mean scores of infit and outfit MNSQ statistics of 1.00 for items and .98 for people mirror the average fit of the items to the Rasch model’s expectations.

Page 60: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

5352 C. Goh & S. V. Aryadoust

Table 5. Item Measures and Fit Indices of all Items

Item Score Measure SE Infit

MNSQ Outfit MNSQ

1 432 .76 .07 1.05 1.04 2 474 .55 .07 1.03 1.04 3 582 -.01 .07 1.02 1.03 4 670 -.52 .08 0.93 .85 5 632 -.29 .08 0.90 .84 6 406 .90 .07 1.13 1.13 7 529 .27 .07 0.96 .93 8 790 -1.46 .10 0.96 .85 9 725 -.90 .09 0.91 .82

10 723 -.88 .09 0.96 .89 11 665 -.49 .08 1.01 1.02 12 439 .73 .07 1.02 1.02 13 500 .42 .07 1.05 1.04 14 479 .52 .07 1.02 1.01 15 677 -.57 .08 0.95 .91 16 486 .49 .07 1.02 1.00 17 454 .65 .07 1.06 1.05 18 536 .23 .07 0.98 .96 19 546 .18 .07 0.96 .91 20 553 .14 .07 1.03 1.05 21 497 .43 .07 1.00 .98 22 557 .12 .07 0.97 .91 23 609 -.16 .07 1.02 .97 24 718 -.85 .09 0.90 .76 25 616 -.20 .08 1.02 1.02 26 359 1.14 .07 1.16 1.22 27 465 .59 .07 1.03 1.06 28 608 -.16 .07 1.02 1.03 29 540 .21 .07 0.95 .95 30 624 -.25 .08 0.89 .79 31 564 .08 .07 0.96 .95 32 621 -.23 .08 0.96 .96 33 686 -.63 .08 0.96 1.08 34 684 -.61 .08 1.01 1.02 35 570 .05 .07 0.99 .96 36 474 .55 .07 1.19 1.26 37 610 -.17 .07 0.93 .85 38 484 .50 .07 1.02 1.01 39 555 .13 .07 1.01 1.00 40 521 .31 .07 1.03 1.00 41 563 .09 .07 1.04 1.09 42 627 -.26 .08 0.97 .98 43 710 -.79 .08 0.92 .78

Page 61: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

53Investigating the Construct Validity of the MELAB Listening Test through the Rasch Analysis and Correlated Uniquenes Modeling

Item Score Measure SE Infit

MNSQ Outfit MNSQ

44 446 .69 .07 1.05 1.11 45 758 -1.16 .09 1.01 1.08 46 695 -.69 .08 0.92 .84 47 585 -.03 .07 1.02 1.04 48 602 -.12 .07 1.07 1.15 49 422 .81 .07 1.07 1.08 50 596 -.09 .07 0.98 .94

Mean 573.3 .00 .08 1.00 .98 Note. n = 916. MNSQ = Mean Square. SE = standard error of measurement. To examine fit and person ability/item difficulty measures concurrently, Bond and Fox (2007) generated a bubble chart that plots measures and fit statistics. This analysis displays visually the relationship between ability/difficulty measures and the magnitude of measurement error. Figure 5 displays bubble charts of items’ MNSQ statistics plotted against item difficulty (upper part) and person ability (lower part) measures; all item infit MNSQ statistics are closely distributed around the item fit mean (1.00), indicating good measurement properties of items. Figure 5 further shows that, as we expected, standard error (SE) of measurement is especially high in persons landed at the top of the hierarchy. The magnitude of SE is displayed as the size of the circles: the bigger the circle, the higher the SE. The reason for observing high SE indices for high-ability persons is that they did not get enough items corresponding to their ability level. Located at the top of the chart, these high-ability people answered all or the majority of items correctly; there is not enough information about their ability. For example, even if an individual with a high measured ability is most probably able to answer all items correctly, what type of item can inform us about the upper boundary of their ability? If these individuals answer a sufficient number of items, then we can collect more information about their ability as compared with when they do not receive sufficient number of items. As we move down the person bubble chart, the SE size decreases. This is due to the fact that lower ability people received enough items which corresponded to their ability; and their ability was therefore estimated with lower error.

There was no misfitting person in the sample. According to Pollitt and Hutchinson (1987), if person misfit does not exceed 2% of the data, then there is no significant erratic response pattern; we can opt for acceptable person performance, indicating that their performance accords with the expectations of the model.

To analyze possible patterns or structures in residuals, we performed a principal component analysis of residuals (PCAR). This analysis demonstrates “contrasts between opposing factors, not loading on one factor”, i.e., the contrast between positive and negative loading values (Linacre, 2009, p. 216). PCAR is a test of unidimensionality of the data set, a prerequisite to the Rasch model analysis; unidimensionality holds when test scores are not contaminated by any irrelevant factor and means that no datum affects the other one in the data set (Linacre, 2009). If no structure or pattern is observed in residuals, the variation in data which is not explained by the Rasch model is “random noise” (Linacre, 2009). It is expected that the correlation between the random noise of two items be ideally zero or very weak.

Page 62: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

5554 C. Goh & S. V. Aryadoust

Figure 5.Bubble Chart Plotting Item and Person Infit MNSQ Statistics against Item and Person Measures. The size of bubble for items (upper chart) is consistent and small. This implies that the SE was small for items. Bubbles representing persons (lower chart) range from big to small, indicating that the ability of more proficient people has been estimated with greater amounts of SE.

Page 63: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

55Investigating the Construct Validity of the MELAB Listening Test through the Rasch Analysis and Correlated Uniquenes Modeling

According to our PCAR analysis, total variance explained by measures was 27.7%. This Rasch dimension is similar to the first factor in a principal component analysis of raw data, but it is based on linearized data (Wright, 1996). If the difference between the variance explained by Rasch dimension and the noise is considerably high, the unidimensionality of the test is supported.

Three weak factors were identified in residuals. The first one extracted 1.7 out of 50 Eigenvalue units, which is less than 5%; the strength of the Rasch dimension is almost 25 times this factor. Linacre (2009) argued that the smallest Eigenvalue regarded a structure or pattern in residuals is three; the observed value (1.7) does not reach Linacre’s benchmark (1.7 > 3). This factor comprises two items, 17 and 25. But, as we observe in Table 5, their infit and outfit MNSQ statistics are only slightly deviating from 1, indicating that they are as predictable as the model expects. Factors two and three did not extract considerable Eigenvalue units. So, the observed factors are not “contradictory dimensions” (Linacre, 2009); this observation provides evidence for the unidimensionality of the test (Wright & Stone, 1999). Also, the analysis of correlation between item residuals showed there was no significant correlation between item residuals. This finding backs up adherence to the local independence of items.

Second analysis: Testing for invariance. As an additional step in understanding construct threats, we performed a uniform differential item functioning (DIF) analysis to examine any gender bias. According to Linacre (2009), for a DIF to be significant, two criteria should hold: “1. probability so small that it is unlikely that the DIF effect is merely a random accident 2. size so large that the DIF effect has a substantive impact on scores/measures on the test” (p. 148). The minimum noticeable DIF difference is 0.5 logits for items and the probability of observing DIF in items should be less than 0.05. Thus, a considerable DIF is not merely a function of the significance, but the difference should also have statistical substance.

DIF measure in Table 6 displays the difficulty of each item for a gender class; class 1 is male and class 2 female. For example, the local difficulty of item 1 for the Male Class is 0.88 logits and for the Female Class is 0.65 logits. A positive DIF contrast index indicates that the item is more difficult for the first group and a negative index shows the item is more difficult for the second group. As we observe in Table 6, item 1 is 0.23 logits more difficult for male candidates whereas item 2 is -0.18 logits more difficult for female candidates. The Welch t test expresses DIF significance as a two-sided Student’s t-statistic. The null hypothesis is that the two DIF estimates are equal, considering measurement errors. The p value column shows the probability of the t with the degree of freedom (Linacre, 2009). Eight items have significant DIF t-tests (p < 0.05). Items 6, 7, 21, 35, and 44 are more difficult for the Female Class (male candidates are more able on these items) and items 39 and 43 are more difficult for the Male Class (female candidates are more able on these items).

The observation of DIF needs to be further investigated to ascertain whether the observed DIF is a construct issue. If there is strong evidence that the DIF observed concerns some known construct issues, the item would most probably be retained in future administrations of the test. In the present analysis, because DIF is not balanced out in three items, the observed DIF, as we view it, attenuates the construct validity argument of the test. However, the effect of four DIF items is balanced out, which supports the construct validity of the test.

Page 64: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

5756 C. Goh & S. V. Aryadoust

Table 6. Gender Differential Item Functioning in the MELAB Listening Test

DIF measure 1

DIF Measure 2

DIF Contrast Welch t p Item

0.88 0.65 0.23 1.52 .129 1 0.48 0.65 -0.18 -1.18 .237 2 0.16 -0.11 0.27 1.77 .076 3

-0.56 -0.51 -0.05 -0.30 .764 4 -0.43 -0.16 -0.27 -1.72 .086 5 0.72 1.04 -0.31 -2.11 .035 6* 0.01 0.50 -0.49 -3.27 .001 7*

-1.38 -1.50 0.12 0.58 .560 8 -0.86 -1.00 0.13 0.73 .468 9 -0.83 -0.93 0.10 0.55 .580 10 -0.39 -0.66 0.28 1.68 .093 11 0.73 0.74 -0.01 -0.04 .966 12 0.35 0.48 -0.13 -0.85 .398 13 0.42 0.58 -0.16 -1.05 .294 14

-0.71 -0.49 -0.22 -1.32 .187 15 0.41 0.57 -0.16 -1.05 .293 16 0.74 0.55 0.19 1.27 .203 17 0.31 0.11 0.20 1.34 .179 18 0.07 0.31 -0.24 -1.58 .114 19 0.15 0.14 0.01 0.06 .955 20 0.28 0.64 -0.37 -2.46 .014 21* 0.01 0.20 -0.19 -1.24 .217 22

-0.10 -0.17 0.07 0.45 .654 23 -0.88 -0.81 -0.07 -0.41 .681 24 -0.18 -0.24 0.06 0.35 .724 25 1.21 1.08 0.13 0.85 .398 26 0.62 0.62 0.00 0.01 .991 27

-0.25 -0.08 -0.16 -1.03 .302 28 0.15 0.30 -0.15 -0.97 .331 29

-0.18 -0.34 0.15 0.97 .331 30 -0.03 0.16 -0.19 -1.25 .212 31 -0.20 -0.31 0.12 0.74 .461 32 -0.55 -0.68 0.13 0.76 .445 33 -0.58 -0.66 0.09 0.51 .609 34 -0.13 0.22 -0.35 -2.32 .020 35* 0.64 0.52 0.12 0.82 .414 36

-0.25 -0.10 -0.15 -0.96 .338 37 0.58 0.35 0.23 1.53 .125 38 0.40 -0.13 0.53 3.51 .001 39* 0.47 0.19 0.28 1.88 .060 40 0.22 0.02 0.20 1.33 .184 41

-0.23 -0.30 0.07 0.42 .675 42 -0.64 -0.98 0.34 1.96 .050 43*

Page 65: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

57Investigating the Construct Validity of the MELAB Listening Test through the Rasch Analysis and Correlated Uniquenes Modeling

DIF measure 1

DIF Measure 2

DIF Contrast Welch t p Item

0.35 1.01 -0.66 -4.40 .000 44* -1.05 -1.30 0.25 1.28 .200 45 -0.71 -0.68 -0.03 -0.18 .854 46 -0.04 0.04 -0.08 -0.51 .608 47 -0.18 -0.16 -0.03 -0.18 .856 48 1.04 0.55 0.48 3.23 .001 49* 0.05 -0.16 0.20 1.31 .191 50

Note. DIF measure 1 is the local difficulty of each items for male participants and DIF measure 2 is this index for female participants. The “*” sing means that the item has a significant DIF.

Weighing the Evidence for the Validity of MELAB To sum up the findings in the current study, we use Chapelle’s (1994) table to display

the evidence supporting or attenuating the validity of test scores’ interpretations. Table 7 demonstrates two groups of evidence. Evidence supporting construct validity consists of the results of the reliability analysis (cases above .70), content analysis which identified the factors and skills stated in the MELAB Technical Manual, CFA supporting the factor structure of the test, Rasch measures, fit, reliability, and PCAR which supported the absence of construct irrelevant factors, invariance analysis showing that the majority of items functioned similarly across gender subgroups. On the other hand, the reliability indices smaller than .70, the CUM and higher-order models, and DIF in three items attenuate the construct validity of the test. Table 7. Evidence Supporting and Attenuating the Construct Validity Argument of the MELAB Listening Test

Evidence supporting construct validity Evidence against construct validity

1) KR-21 analysis (above .70) 1) KR-21 analysis (below .70) 2) Content analysis 2) DIF not balanced out in three items 3) CFA 4) Rasch measures in 50-item analysis 5) Infit and outfit in 50-item analysis

6) PCAR with 50 items

7) Rasch reliability indices 8) Invariance analysis: Four DIF are balanced out

Page 66: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

5958 C. Goh & S. V. Aryadoust

According to Table 7, the argument for construct validity of the MELAB test is supported by more evidence than it is attenuated by counterevidence.

Discussion

First Objective: Factor Structure of the Test To examine the features of the hypothesized three- and five-factor models and the

cause of variation in items, we performed a CUM and a second-order CFA analysis to build the construct map or factor structure of the test. The five-factor model had a competency-based approach towards the test, which was not supported. Because the proposed five-factor model has a supported theoretical underpinning in literature and MELAB Technical Manual, we argue that future research should address this area in other test forms.

If item correlations are erratic, factors may not be successfully separated, and the expected patterns will not be generated in factor analysis. In the five-factor CFA model we assumed that in answering some items test takers rely on their ability to comprehend explicitly stated information and in others the ability to comprehend implicitly stated information. However, as Wagner (2004) argued and as this study showed, separating these two skills may not produce optimum results or models in measurement. Even in Kinsch and van Dijk’s (1983) model of comprehension, these two processes take place simultaneously. Therefore, we suggest that this dichotomy may be only an artefact.

The main hurdle to performing the CUM analysis was that the traits were not measured by two or more common methods. As a compensatory strategy, we posited a three-factor, task-based model according to the test sections. We thus “moved from a strictly confirmatory mode to an exploratory mode…to arrive at a model that would provide a reasonable explanation for the correlations among their variables” (Bachman, 2004, p. 285); we revised the CUM model and hypothesized that the underlying factors—ability to understand minimal context stimuli, short conversations, and longer radio interviews—are separable and cause the observed variation in data. This model had acceptably good fit and provided good support for the causality of test behavior by the three hypothesized latent traits: that all items loaded significantly onto the posited latent traits, as significant path coefficients showed, indicates that the variance in items is significantly accounted for by the latent traits, and latent traits are the cause of indicators (Hair et al., 2010). The analysis further showed that the correlation among these traits was significantly high.

The question of separability of listening traits has been dealt with in previous research. For example, Liao (2007) reported the correlation coefficient of 0.97 between explicit and implicit listening factors in the CFA study stating “these two factors are closely interrelated, but still not identical” (p. 60). Such a conclusion is in variation with the common school of thought: considerable correlation coefficients above .80 are indicative of significant similarity among the hypothesized factors and their inseparability (Hair et al., 2010). We argue that significantly high correlations of the three traits in turn can be evidence of the concurrent occurrence of the local and general comprehension strategies when test takers answer these items (van Dijk & Kintsch, 1983); the model fitting the data illustrates this relationship. The results show that comprehension is a complex and intertwined process and attempts to separate comprehension stages and skills may not be completely successful (see Bae and Bachman’s [1998] study, where the separability of listening and reading traits, as two major and distinct skills, is not clearly established).

Page 67: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

59Investigating the Construct Validity of the MELAB Listening Test through the Rasch Analysis and Correlated Uniquenes Modeling

More recently, Borsboom, Mellenbergh, and Van Heerden (2004) redefined validity and argued that high correlations between different traits do not necessarily point to invalidity. According to Borsboom et al., the argument that significantly high correlations in CFA imply the presence of same traits leads the researcher into murky waters, pointing out that:

For instance, suppose one is measuring the presence of thunder. The readings will probably show a perfect correlation with the presence of lightening. The reason is that both are the results of an electrical discharge in the clouds. However, the presence of thunder and the presence of lightening are not the same thing under a different label. They are strongly related—one can be used to find out about the other—and there is a good basis for prediction, but they are not the same. (p. 1066; emphasis added)

By the same token, while the three hypothesized traits in the present study (the ability to understand minimal context stimuli, short conversations, and longer radio interviews) have caused the variation in the scores, as shown by significant regression weights, significantly high correlation coefficients among the traits (or factors) do not testify to the presence of identical traits. They are different, as lightening and thunder are, but also have high correlations, as lightening and thunder readings do. A more important observation is that the hypothesized traits were found to be causing a great amount of variation in scores.

That the arrows in Figure 2 and 3 move from the latent variable to items indicates that the variance in items is mainly caused by the trait, significant evidence of validity of the hypothesized trait (Borsboom et al., 2004). This observation carries an important implication: hypothetically, neither textual competence nor functional knowledge introduces measurable construct-irrelevant variance to measurement in minimal context items because they are principal component of the postulated trait; a reliable assumption would then be that minimal context items measure textual competence and functional knowledge. Yet, while they tap the intended construct, minimal context items belong to an older generation of listening items known as discrete-point items (Buck, 2001). The discrepancy between our findings, generally in favor of these items, and the mainstream literature, which highlights the reduction of context and communication as a shortcoming of these items, is worth further investigation: it seems that to answer the minimal context items, candidates use their prior knowledge and, more importantly, activate their textual competence, including vocabulary, syntax, and phonology (Bachman, 1990) and functional knowledge (MELAB Technical Manual, 2003). It is important to determine the extent to which candidates, inability to comprehend and respond correctly is caused by his/her inability to understand the meaning of phrases or certain lexical items and their lexico-grammatical relationships as apposed to the lack of a context. For example, Goh (2000) showed that some EFL learners can hear the words exactly and match them to sounds and words in their mental lexicon (recognition), but they may not be able to understand the prompt or stimulus. This question about minimal context items, we believe, should be further researched.

The second item type entails short conversations. From a competency-based viewpoint, these items are intended to measure the ability to comprehend explicitly and implicitly stated information. From a task-based viewpoint, items measure the ability to comprehend messages in short daily conversations if the interlocutor gets involved in such transactions. The former delineation assumes two dimensions for listening comprehension, whereas the latter definition hypothesizes a broader and less clearly partitioned construct.

Page 68: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

6160 C. Goh & S. V. Aryadoust

When completing a task which is mainly based on listening comprehension skills, an interlocutor may use the ability to understand both explicitly and implicitly stated information, but they happen at the same time (Kintsch & van Dijk, 1978) and separating them is not a sound practice although some researchers, such as Shohamy and Inbar (1991), asserted that items measuring these two skills must be always present in any test of listening comprehension. We should not always expect having two clearly separate dimensions which function on their own. Because the content analysis supported Shohamy and Inbar’s (1991) hypothesis, which resonates with a competency-based approach, but the trait was not divisible in the present study (see also Hansen and Jensen, 1994; Buck, 2001; Wagner, 2004), we tentatively propose that variations caused in the conversation items in the present study are attributable to a more general trait: the ability to understand short context conversations and the subskills have functioned to answer the items. Further research will be needed to elaborate on the connection between the general ability to understand short context conversations and its subskills.

The third section in the test includes the ability to comprehend longer interviews. The content analysis in our study showed that items tap two skills, understanding explicitly articulated information and the ability to make close paraphrases. Evidence was proposed that variation in the items is attributed to a general trait which we refer to as the ability to comprehend lengthy pieces of discourse, e.g., longer interviews and/or talks. This representation of the trait includes both task and competence features, which would be more properly construed as a task-based definition (Bachman, 2002).

In this light, we propose a tentative model with three correlated and relevant factors to explain the structure of this form of the MELAB listening test in this study (see Figure 6).

Figure 6 is a construct map or factor structure which displays traits and manifest variables (items). Now that such a map is proposed, future research can further evaluate its validity and reliability. In this model, there are three latent causal connections (double-headed arrows) that link the latent traits to manifest or measured variables. Therefore, three constructs presumably cause the responses and variation in them. For example, the presence of skills to understand minimal context stimuli, as a latent trait, is measured by items which operationalize this trait. These items target textual competence and textual knowledge, as is proposed in the construct map of the trait. On the whole, the results point to three clear factors in the test as defined by the sections in the test structure itself.

Second Objective: Construct Threats and Underrepresentation

We performed two Rasch-based analyses to fulfill the second objective. The first analysis showed there was no misfit according to the fit MNSQ indices. When unidimensionality and local independence hold, the fact that all items fit the Rasch model supports “item function validity” (Wright & Stone, 1999). Item function validity (IFV) concerns the integrity of items and their functions: whether and how much their function agrees with or deviates from the expectations of the Rasch model. IFV assures the good measurement properties of the items in terms of their consistency with the model. In our study, easy items functioned according to the expectations of the model—in other words, high ability test takers answered easy items correctly but low ability test takers did not answer difficult items. This provides evidence for the absence of construct-irrelevant factors in items because erratic response patterns are the function of a trait other than the hypothesized one.

Page 69: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

61Investigating the Construct Validity of the MELAB Listening Test through the Rasch Analysis and Correlated Uniquenes Modeling

Figure 6.The Operationalized Listening Comprehension Model in the MELAB Listening Test. This model is a tentative construct map of the test which was analyzed in this study.

The principal component analysis of residuals (PCAR) showed that there was no substantive factor in residuals although two items loaded weakly onto the first identified factors in the residuals. The observation from PCAR lends support to the response validity (RV) of the test, which is determined from the observed differences between a response set and our expectations (residuals or random noise) (Wright & Stone, 1999). Large residuals are observed when lower ability persons answer a difficult item unexpectedly or when higher ability persons fail to answer easy items. That we found support for RV means that the Rasch dimension is dominant in the data and there is no conspicuous dimension beside it (Linacre, 2009). Therefore, both high and low ability students’ performance resonates with the Rasch model expectations. One implication is that cheating, miskeying the data, fatigue, environmental factors, such as temperature, familiarity with personnel, and other facets (Bachman, 1990) did not contaminate the measurement. Taken as a whole, RV and IFV support the validity of the tests scores’ meaning by providing counterevidence against construct irrelevant variance.

Three major causes of the variation observed in the measured variables or

items

Measured variables or items which operationalize three

major traits

Page 70: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

6362 C. Goh & S. V. Aryadoust

Another observation from Rasch measurement is that the item mean was slightly below the person mean index, and items were skewed downward on the item-person map. This indicates that the test has been relatively easy for the sample. According to the item-person map, there are few items suitable for measuring many persons with ability measures greater than 1 logit; the test displays a ceiling effect by not taking on a higher value. The consequence of the ceiling effect is that higher ability individuals are underestimated whereas lower or intermediate ability candidates tend to be overestimated. It may introduce some degree of construct underrepresentation in the MELAB test.

As a rule of thumb, adding more difficult items to the test may resolve the ceiling effect issue by revealing the true abilities of candidates in relation to these the targeted listening skills. This recommendation would be useful if the intent of the testing centre is to obtain detailed trait levels of candidates, particularly those with better overall listening abilities. However, if the aim (and we believe this may be the case) is to distinguish candidates who are able to perform satisfactorily based on a set of minimal requirements for entry to institutions of higher learning, then the MELAB listening test has sufficient validity to make this distinction.

We performed the second Rasch analysis to explore the invariance of the scores. The invariance or lack of DIF analysis helps generalizing the observed test results to expected scores (Aryadoust, 2009). However, observation of DIF in gender and other person factors is not always explicable. The inexplicablity is either in terms of the item structure or what is known about the population. DIF may be observed in an item but other similar items may not display any DIF. Geranpayeh and Kunnan (2007) reported this “mysterious” DIF in a study of a Cambridge English exam. If we consider the content of such items to be the major cause, then we would expect to observe the same phenomenon across all similar items targeting the same trait. Analyzing the items that display DIF for gender in the present study, we found that DIF items did not lack any other feature that would have affected students’ performance, and neither did they possess an extra feature to affect performance. Also, DIF does not cause measurement problems if some items biasing a group are balanced out by another set of items biasing another group. In our study, four items are balanced out whereas three items are more difficult for females. The MELAB listening test has managed to keep some construct irrelevant factors at the minimum, such as the skill to read items and response options. This makes the task of interpreting DIF more complex because DIF is likely to be caused by a confounding variable.

Conclusion Validation does not always provide a definitive “yes” or “no” answer to validity

inquiries (Chapelle, 1994). It is a dynamic process that never ends but develops as the science of measurement improves (Kane, 2004). The present study set out to determine the construct map of one form of the MELAB listening test, construct underrepresentation, and construct irrelevant threats; the validity of the MELAB listening test is supported by a considerable amount of evidence; multiple evidence from reliability and content analysis, CFA, and the Rasch model clearly support the construct validity argument although part of reliability analysis and DIF do not.

The study also showed the efficacy of CFA as a latent trait model in investigating the causality of test behavior and proposing construct maps underpinning a test. It also showed the efficiency of the Rasch model in investigating construct underrepresentation and irrelevant

Page 71: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

63Investigating the Construct Validity of the MELAB Listening Test through the Rasch Analysis and Correlated Uniquenes Modeling

factors. The main limitation of this study is that it only examined one form of the listening test. Although all forms of standardized tests are parallel, the claims in this study pertain to the test form that we examined. A replication of this study using other samples of participants and forms of the listening test will help deepen our understanding of some of the issues identified in this study.

As noted earlier, a future step in this area of research could be the examination of the influence of candidate’s functional knowledge and textual competence on their test performance. This investigation helps us understand the validity of the minimal context items as a way of measuring listening comprehension. Further, although it was confirmed that the dichotomy of the comprehension of explicit or implicit information is an artefact, it is important to study further the effect of such test objectives on the difficulty of items in future research. Therefore, we propose two further validation inquiries:

(a) What is the status of construct representation and construct irrelevant variance in

other MELAB listening test versions? (b) How does the objective of the item (testing the comprehension of explicit or

implicit information) affect item difficulty? It is hoped that this study has provided some useful insights into the issues

surrounding the examination of construct validity of the MELAB listening test.

References Aryadoust, S.V. (2009). Mapping the Rasch-based measurement onto the argument-based

validity framework. Rasch Measurement Transactions, 23(1), 1192–1193. Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford

University Press. Bachman, L.F. (2002). Some reflections on task-based language performance assessment.

Language Testing 19(4), 453–476. Bachman, L.F. (2004). Statistical analyses for language assessment. Cambridge: Cambridge

University Press. Bae, J., & Bachman, L.F. (1998). A latent variable approach to listening and reading: testing

factorial invariance across two groups of children in the Korean/English two-way immersion program. Language Testing, 15(3), 380–414.

Bejar, I., Douglas, D., Jamieson, J., Nissan, S., & Turner, J. (2000). TOEFL® 2000 listening framework: A working paper.(TOEFL Research Report No. RM-00-07, TOEFL-MS-19). Princeton, NJ: Educational Testing Service.

Bond, T.G. (2003). Validity and assessment: A Rasch measurement Perspective. Methodologia de las Ciencias del Comportamiento, 5(2),179–194.

Bond, T.G., & Fox, C.M. (2007). Applying the Rasch model: Fundamental measurement in the human sciences. London: Lawrence Erlbaum Associates.

Bonk, W. J. (2000). Second language lexical knowledge and listening comprehension. International Journal of Listening, 14, 14–31.

Borsboom, D., Mellenbergh, G.J., & Van Heerden, J. (2004). The concept of validity. Psychological Review, 111(4), 1061–1071.

Brindley, G (1998). Assessing listening abilities. Annual Review of Applied Linguistics, 18(1), 171–191.

Page 72: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

6564 C. Goh & S. V. Aryadoust

Brown, T. A. (2006). Confirmatory factor analysis for applied research. New York: The Guilford Press.

Buck, G. (1990). The testing of second language listening comprehension. Unpublished PhD dissertation: University of Lancaster.

Buck, G. (1991). The testing of second language listening comprehension: An introspective study. Language Testing, 8(1), 67–91.

Buck, G. (1992). Listening comprehension: Construct validity and trait characteristics. Language Testing, 42(3), 313–357.

Buck, G. (1994). The appropriacy of psychometric measurement models for testing second language listening comprehension. Language Testing, 11(2), 145–170.

Buck, G. (2001). Assessing listening. UK: Cambridge University Press. Byrne, B. N. (2001). Structural equation modeling with AMOS. Rahwah, NJ: Lawrence

Erlbaum Associates. Campbell, D. T., & Fiske, D. W. (1959) Convergent and discriminant validation by the

multitrait-multimethod matrix. Psychological Bulletin, 56(1), 81–105. Campbell, D. T., & O’Connell, E. J. (1967). Methods factors in multitrait-multimethod

matrices: Multiplicative rather than additive? Multivariate Behavioral Research, 2(3), 409–426.

Chapelle, C. (1994). Are C-tests valid measures for L2 vocabulary research? Second Language Research, 10(2), 157–187.

Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (Eds.). (2008). Test score interpretation and use. In C.A., Chapelle, M.K., Enright, & J.M., Jamieson (Eds.), Building a validity argument for the Test of English as a Foreign Language (pp. 1–25). New York: Routledge.

Conway, J. M, Lievens, F., Scullen, S. E., Lance, C.E., (2004). Bias in the correlated uniqueness model for MTMM data. Structural Equation Modeling, 11(4), 535–559.

Cudeck, R. (1988). Multiplicative models and MTMM matrices. Journal of Educational Statistics, 13(2), 131–147.

Dijk, T. A. van, & Kintsch, W. (1983). Strategies of discourse comprehension. New York: Academic Press Inc.

Du Toit, M., & Du Toit, S. (2001). Interactive LISREL: User’s guide. Lincolnwood, IL: Scientific Software International, Inc.

Dunkel, P., Henning, G., & Chaudron, C. (1993). The assessment of a listening comprehension construct: A tentative model for test specification and development. Modern Language Journal, 77(2), 180-191.

English Language Institute, University of Michigan. (2003). Michigan English language assessment battery technical manual 2003. Ann Arbor: English Language Institute, University of Michigan.

English Language Institute, University of Michigan. (2009). MELAB information and registration bulletin. Ann Arbor: English Language Institute, University of Michigan.

Eom, M. (2008). Underlying factors of MELAB listening construct. Spaan Fellow Working Papers in Second or Foreign Language Assessment, 6(1), 77–94.

Farhady, H. (1983). On the plausibility of the unitary language proficiency factor. In Oller, J.W., (Ed.), Issues in language testing (pp. 11-28), Rowley, MA: Newbury House.

Geranpayeh, A., & Kunnan, A.J. (2007). Differential item functioning in terms of age in the certificate in advanced English examination. LanguageAssessment Quarterly, 4(2), 190–222.

Page 73: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

65Investigating the Construct Validity of the MELAB Listening Test through the Rasch Analysis and Correlated Uniquenes Modeling

Glenn, E. (1989). A content analysis of fifty definitions in listening. Journal of the International Listening Association, 3(1), 21–31.

Goh, C. (2000). A cognitive perspective on language learners' listening comprehension problems. System, 28(1), 55–75.

Hair, J.F., Black, W.C., Babin, B.J., & Anderson, R.E. (2010). Multivariate data analysis (8th ed.). New Jersey: Pearson Educational Product.

Haladyna, T.M., & Downing, S.M. (2004). Construct-irrelevant variance in high-stakes testing. Educational Measurement, Issues and Practices, 23(1), 17–27.

Hansen, C., & Jensen, C. (1994). Evaluating lecture comprehension. In J. Flowerdew (Ed.), Academic listening: Research perspective (pp. 241–268). Cambridge: Cambridge University Press.

Hildyard, A., & Olson, D. (1978). Memory and inference in the comprehension of oral and written discourse. Discourse Processes, 1(1), 91–107.

Jöreskog, K.G., & Sörbom, D. (2006). LISREL 8.8 for Windows [Computer software]. Lincolnwood, IL: Scientific Software International, Inc.

Kane, M. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527–535.

Kane, M. (2001). Current concerns in validity theory. Journal of Educational Measurement, 38(4), 319–342.

Kane, M. (2002). Validating high-stakes testing programs. Educational Measurement: Issues and Practice, 21(1), 31–41.

Kane, M. (2004). Certification testing as an illustration of argument-based validation. Measurement: Interdisciplinary Research and Perspectives, 2(3), 135–170.

Kane, M. (2006). Validation. In R. L. Brennan (ed.), Educational measurement (4th ed.), (pp. 17–64). USA: American Council on Education, Praeger Series on Higher Education.

Kenny, D. A. (1976). An empirical application of confirmatory factor analysis to the multitrait-multimethod matrix. Journal of Experimental Social Psychology, 65(4), 507–516.

Kintsch. W., & van Dijk,T. A. (1978). Toward a model of text comprehension and production. Psychological Review, 85, 363–94.

Kirsch, I.S., & Guthrie, J.T. (1980). Construct validity of functional reading tests. Journal of Educational Measurement, 17(2), 81–93.

Lance, C. E., Noble, C. L., & Scullen, S. E. (2002). A critique of the correlated trait-correlated method and correlated uniqueness models for multitrait-multimethod data. Psychological Methods, 7(2), 228–244.

Lievens, F., & Conway, J. M. (2001). Dimension and exercise variance in assessment center scores: A large-scale evaluation of multitrait-multimethod studies. Journal of Applied Psychology, 86, 1202–1222.

Linacre, J. M. (2003). Size vs. significance: Standardized chi-square fit statistic. Rasch Measurement Transactions, 17(1), 918. Retrieved May 04, 2009, from http://www.rasch.org/rmt/rmt171n.htm.

Linacre, J. M. (2005). WINSTEPS Rasch measurement [computer program]. Chicago: Winsteps.com.

Linacre, J. M. (2009). A user’s guide to WINSTEPS. Chicago: Winsteps.com. Liao, (2007). Investigating the construct validity of the grammar and vocabulary section and

the listening section of the ECCE: Lexico-Grammatical ability as a predictor of L2

Page 74: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

6766 C. Goh & S. V. Aryadoust

listening ability. Spaan Fellow Working Papers in Second or Foreign Language Assessment, 5, 37–78.

Marsh, H.W. (1989). Confirmatory factor analysis of multitrait-multimethod data: Many problems and a few solutions. Applied Psychological Measurement, 13, 335–361.

Marsh, H.W., & Hocevar, D. (1988). A new, more powerful approach to multitrait-multimethod analyses: Application of second-order confirmatory factor analysis. Journal of Applied Psychology, 73, 107–117.

McNamara, T. (1996). Measuring second language performance. Longman: New York.

Meccarty, F. (2000). Lexical and grammatical knowledge in reading and listening comprehension by foreign language learners of Spanish. Applied Language Learning, 11, 323–348.

Messick, S. (1989). Validity. In R. L. Linn (Ed), Educational measurement (pp. 13–103). New York: ACE and Macmillan.

Miles, J., & Shevlin, M. (2007). A time and a place for incremental fit indices. Personality and Individual Differences, 42(5), 869–874.

Nation, I.S.P., & Newton, J. (2009). Teaching ESL/EFL listening and speaking. New York: Routledge.

Oller, J. W., Jr. (1978). How important is language proficiency to IQ and other educational tests. In J.W. Oller, Jr. & K. Perkins (Eds.), Language in education: Testing the test (pp. 1-16). Rowley, Massachusetts: Newbery House Publishers.

Oller, J. W., Jr. (1979). Language tests at school. London: Longman. Oller, J. W., Jr. (1983). Evidence for a general proficiency factor: An Expectancy grammar. In

J.W. Oller, Jr. (Ed.), Issues in language testing research (pp. 3–10). Rowley, Massachusetts: Newbery House Publishers.

Oller, J. W., Jr., & Hinofitis, F.B. (1980). Two manually exclusive hypotheses about second language ability: Indivisible or partly divisible competence. In J. W. Oller, Jr., & K. Perkins (Eds.), Research in language testing (pp. 13–23). Rowley, Massachusetts: Newbery House Publishers.

Pallant, J. (2007). SPSS survival manual: a step by step guide to data analysis using SPSS for Windows. New York: McGraw Hill/Open University Press.

Pollitt, A., & Hutchinson, C. (1987). Calibrating graded assessments: Rasch partial credit analysis of performance in writing. Language Testing, 4(1), 72–92.

Raykov, T., & Marcoulides, G., A. (1999). On desirability of parsimony in structural equation model selection. Structural Equation Modeling: A Multidisciplinary Journal, 6(3), 292–300.

Richards, J. C. (1983). Listening comprehension: Approach, design, procedure. TESOL Quarterly, 17(2), 219-39.

Rost, M. (1990). Listening in language learning. New York: Longman. Sawaki, Y., Sticker, L.J., & Andreas, H.O. (2009). Factor structure of the TOEFL Internet-

based test. Language Testing 26(1), 5–30. Schmitt, N., & Stults, D. M. (1986). Methodology review: Analysis of multitrait-multimethod

matrices. Applied Psychological Measurement, 10(1), 1–22. Scholz, G., Hendricks, D., Spurling, R., Johnson, M. &Vandenburg, L. (1980). Is language

ability divisible or unitary? a factor analysis of 22 English language proficiency tests. In J.W. Oller, Jr. & K. Perkins (Eds.), Research in language testing (pp. 24–33). Rowley, Massachusetts: Newbery House.

Page 75: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

67Investigating the Construct Validity of the MELAB Listening Test through the Rasch Analysis and Correlated Uniquenes Modeling

Schumacker, R.E., &Lomax, R.G. (2004). A beginner's guide to structural equation modeling. Mahwah, NJ: Lawrence Erlbaum.

Shin, S. (2008). Examining the construct validity of a web-based academic listening test: An investigation of the effects of response formats. Spaan Fellow Working Papers in Second or Foreign Language Assessment, 6(1), 95–129.

Shohamy, E., & Inbar. O. (1991). Construct validation of listening comprehension tests: The effect of text and question type. Language Testing, 8(1), 23–40.

Steiger, J.H. (2007). Understanding the limitations of global fit assessment in structural equation modeling. Personality and Individual Differences, 42(5), 893–898.

Thompson, B. (2004). Exploratory and confirmatory factor analysis: Understanding concepts and applications. Washington, DC: American Psychological Association.

Tomás, J. M., Hontangas, P. M., & Oliver, A. (2000). Linear confirmatory factor models to evaluate multitrait-multimethod matrices: The effects of number of indicators and correlation among methods. Multivariate Behavioral Research, 35(4), 469–499.

Tsui, A. B. M., & Fullilove, J. (1998). Bottom-up or top-down processing as a discriminator of L2 listening performance. Applied Linguistics, 19(4), 432–451.

Uebersax, J. S. (2006). The tetrachoric and polychoric correlation coefficients. Retrieved April, 15, 2010, from the Statistical Methods for Rater Agreement web site at: http://john-uebersax.com/stat/tetra.htm.

Wagner, A. (2002). Video listening tests: A pilot study. Working Papers in TESOL and Applied Linguistics, Teachers College, Columbia University, 2/1.

Wagner, A. (2004). A construct validation study of the extended listening sections of the ECRE and MELAB. Spaan Fellow Working Papers in Second or Foreign Language Assessment, 2, 1-23.

Widaman, K. F. (1985). Hierarchically nested covariance structures models for multitrait-multimethod data. Applied Psychological Measurement, 9(1), 1–26.

Widaman, K. F. (1992). Multitrait-multimethod models in aging research. Experimental Aging Research, 18(2), 185–201.

Widaman, K. F. (2002). To parcel or not to parcel: Exploring the question, weighing the merits. Structural Equation Modelling, 9(2), 151–173.

Wilson, M. (2005). Constructing measures: An item response modelling approach. New York: Psychology Press.

Wright, B. (1996). Local dependency, correlations, and principal components. Rasch Measurement Transactions, 10(3), 509-511.

Wright, B., & Stone, M. (1979). Best test design. Chicago: MESA. Wright, B. D., & Stone, M. H. (1999). Measurement essentials. 2nd ed. Wilmington,

Delaware: Wide Range, Inc.

Acknowledgements We gratefully thank the Spaan Fellowship program for supporting this research.

Page 76: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

PB68 C. Goh & S. V. Aryadoust

Appendix 1

Items Minimal context

Explicit information

Close paraphrase Propositional Enabling Error R

V1 0.15 0.20 0.095 V2 0.15 0.20 0.099 V3 0.14 0.19 0.10 V4 0.17 0.14 0.18 V5 0.21 0.14 0.23 V6 0.097 0.21 0.042 V7 0.21 0.17 0.20 V8 0.11 0.078 0.13 V9 0.16 0.12 0.18

V10 0.14 0.11 0.14 V11 0.12 0.16 0.086 V12 0.17 0.19 0.13 V13 0.13 0.20 0.081 V14 0.16 0.20 0.12 V15 0.17 0.13 0.17 V16 0.15 0.20 0.098 V17 0.12 0.20 0.072 V18 0.17 0.19 0.14 V19 0.19 0.18 0.16 V20 0.14 0.19 0.097

V21 0.17 0.19 0.13

V22 0.17 0.19 0.13 V23 0.14 0.16 0.11 V24 0.17 0.11 0.20 V25 0.11 0.17 0.072 V26 0.066 0.20 0.021

V27 0.15 0.19 0.10

V28 0.14 0.18 0.099 V29 0.19 0.19 0.16 V30 0.23 0.14 0.26 V31 0.19 0.16 0.19 V32 0.18 0.16 0.17 V33 0.13 0.14 0.11 V34 0.12 0.16 0.089 V35 0.17 0.19 0.13 V36 0.045 0.22 0.0093 V37 0.19 0.16 0.19 V38 0.15 0.19 0.11 V39 0.14 0.19 0.098 V40 0.16 0.19 0.12 V41 0.13 0.19 0.079 V42 0.16 0.16 0.15 V43 0.17 0.12 0.18 V44 0.16 0.19 0.11 V45 0.075 0.11 0.047 V46 0.20 0.12 0.25 V47 0.16 0.18 0.12 V48 0.14 0.17 0.10 V49 0.15 0.20 0.11 V50 0.18 0.17 0.17

Page 77: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

Spaan Fellow Working Papers in Second or Foreign Language AssessmentCopyright © 2010Volume 8: 69–94

English Language InstituteUniversity of Michiganwww.lsa.umich.edu/eli/research/spaan

69

Expanding a Second Language Speaking Rating Scale for Instructional and Assessment Purposes Kornwipa Poonpon Northern Arizona University

ABSTRACT  Recently mandated oral language assessments in English language classrooms have prompted teachers to use published rating scales to assess their students’ speaking ability. The application of these published rating scales can cause problems because they are often so broad that they cannot capture the students’ improvement during a course. Bridging the gap between the teachers’ testing world and the scholars’ testing theories and practices, this study aimed to expand the TOEFL iBT’s integrated speaking rating scale in order to include more fine-grained distinctions to capture progress being made by students. A sample of 119 spoken responses from two integrated tasks was analyzed to identify salient response features in delivery, language use, and topic development from (a) acoustic-based and corpus-based perspectives, and (b) raters’ perspectives. The acoustic-based results revealed speech rate and content-planning pauses were significant predictors of delivery performance. The corpus-based results revealed type/token ratio, proportion of low and high frequency words, error free c-units, stance adverbs, prepositional phrases, relative clauses, and passives as significant predictors of language use performance. Linking adverbials were found to be a predictor of topic development performance. These salient features in combination with the features verbally reflected by the raters, were used to inform the expansion of the speaking rating scale. Guided by empirically derived, binary-choice, boundary-definition (EBB) scales (Upshur & Turner, 1995), the score levels 2 and 3 in the speaking scale were then expanded to describe features of responses slightly below and above each original level. These detailed descriptors were presented as a rating guide made up of a series of choices, resulting in an expansion of scores to 1, 1.5, 2, 2.5, 3, 3.5, and 4 within delivery, language use, and topic development.

Communicative approaches to language testing have led to the widespread use of oral

language assessments in English as a second/foreign language (ESL/EFL) classroom contexts. The inclusion of oral language assessments in ESL/EFL classrooms seems to result in use of published rating scales by the teachers to assess their students’ speaking ability. Application of published rating scales might cause problems because the published rating scales are often too broad to capture the students’ actual ability or students’ language progress over a course (Upshur & Turner, 1995).

Page 78: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

7170 K. Poonpon

As one of the most popular language tests, the Test of English as a Foreign Language Internet-based test (TOEFL iBT) has generated challenges in ESL/EFL teaching and testing because of the inclusion of assessment of integrated speaking skills (i.e., test taker’s ability to use spoken and/or written input to produce a spoken response). With the new task types and new rating scales, it is challenging for scale developers as well as score users to ensure that test scores derived from the test will be appropriately interpreted. In particular, TOEFL iBT scores should inform score users about (a) the extent to which the test takers are likely to perform on speaking tasks across university settings and (b) test takers’ weaknesses or strengths in communication for academic purposes. These score interpretations address two components of language assessment: score generalization and score utilization. The extent to which the scores can be generalized is associated with consistency of ratings. Research on the scoring of TOEFL iBT’s speaking section has found that some raters have difficulty using the rating scales (Brown, Iwashita, & McNamara, 2005; Xi & Mollaun, 2006). In particular, raters cannot clearly distinguish among the three analytic dimensions (i.e., delivery, language use, and topic development) of the scale. Experienced raters may have a good feel for a general description category on the rating scale and then can simply assign one single score. However, novice raters who are trying to attend to the three dimensions before giving a holistic score may encounter problems that result in inconsistent ratings (Xi & Mollaun, 2006), thus low inter-rater reliability. Inconsistent ratings affect the generalization of the test scores. Score utilization is related to an expectation of the TOEFL iBT speaking test to help guide curricula worldwide (e.g., Butler, Eignor, Jones, McNamara, & Suomi, 2000; Jamieson, Jones, Kirsch, Mosenthal, & Taylor, 2000; Wall & Horák, 2006, 2008). Language teachers are being encouraged to use the TOEFL iBT speaking rating scales in their classes to raise students’ awareness of their abilities in relation to the TOEFL scales and to help them improve. However, the existing scales may not provide adequate guidance for the teachers in many pedagogical contexts. While holistic scores have the advantage of being both practical and effective for admissions decisions, they have a disadvantage in classroom situations because they do not provide detailed feedback. Also, the TOEFL rating scales may be simply too broad to capture the student’s actual ability (Clark & Clifford, 1988; Fulcher, 1996b), and if used in classroom settings may not sufficiently allow the teacher to indicate a student’s language progress.

Purpose of the Study

The TOEFL iBT speaking rating scales may need some modifications for instructional and rater training purposes. In an attempt to make the scales and their use more practical, this study aimed to expand the existing TOEFL iBT integrated speaking scale, from four score levels (i.e., 1, 2, 3, and 4) to seven levels (i.e., 1, 1.5, 2, 2.5, 3, 3.5, and 4), in order to include more fine-grained distinctions to make rating easier and more consistent and to capture progress being made by the speaker. Driven by literature related to approaches to scale development (see also Bachman & Palmer, 1996; Bachman & Savignon, 1986; Dandonoli & Henning, 1990; Fulcher, 1987, 1997; Lowe, 1986; North, 2000; Stansfield & Kenyon, 1992), this study used an empirical-based approach to expand the scale because it allowed both quantitative and qualitative data to be used to inform the scale expansion. Inspired by the work of Upshur and Turner (1999) and their empirically derived, binary-choice, boundary-definition (EBB) scales, detailed descriptors were proposed as a rating guide with a series of binary choices, within the categories of delivery, language use, and topic development.

Page 79: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

71Expanding a Second Language Speaking Rating Scale for Instructional and Assessment Purposes

Research Questions

The study was guided by three research questions. 1. What spoken features can distinguish examinees’ performances at the score levels

1, 1.5, 2, 2.5, 3, 3.5, and 4 for delivery, language use, and topic development? 2. What are the relationships between dimension scoring profiles and the ETS

holistic scores? 3. Are the expanded score levels functioning appropriately for delivery, language

use, and topic development?

Methodology Participants

The study included ten English language teachers. All were considered novice speaking raters as they had no formal training in how to use a speaking scale nor had training at least three years prior to the study. Five of them were male and five were female, with ages between 23 and 44. Seven were native speakers of English and three were nonnative English speakers whose first languages were Italian, Chinese, and Arabic. Eight were PhD students in applied linguistics and two were master’s degree students in TESL in an American university. All had at least three years of experience teaching English, including ESL/EFL, English for academic purposes, and English composition. A Test-Based Spoken Corpus

The test-based spoken corpus was taken from a public use data set provided by Educational Testing Service. The corpus consisted of 119 reading/listening/speaking (RLS) spoken responses to two tasks (Task 3 and Task 4). These responses represented two first-language groups: Arabic and Korean. They included 20 responses (10 from each language group) from score levels 1 and 4 (40 total), 40 from score level 2 (20 from each language group), and 39 from score level 3 (only 19 available responses from Arabic and 20 from Korean). Because the focus of the study was at score levels 2 and 3, a smaller number of responses from levels 1 and 4 were needed to be used as the lower and upper boundaries of the scale. This corpus consisted of 13,570 total words.

Data Collection

Data collection was conducted during the summer of 2008. First, the raters were trained to use the TOEFL iBT integrated speaking rating scale by taking an ETS online scoring tutorial from the ETS Online Scoring Network (OSN) tutorial website (permission granted by ETS) focusing on the scoring training for RLS speaking tasks. The raters then were trained to use expanded scores and how to produce verbal protocols. Each of the 10 raters was to evaluate about 96 spoken responses (from a total of 952 scorings—119 responses x 2 scorings x 4 dimensions). The responses were systematically arranged for double scoring across level and first-language group. The raters scored the responses independently at their convenience. The raters completed two parts while scoring. In the first part, they both scored and completed think-aloud reports. They listened to the first set of responses and scored each holistically, using expanded scores (i.e., 1, 1.5, 2, 2.5, 3, 3.5, or 4); they then recorded their verbal reports. The next set of responses was scored on delivery, with

Page 80: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

7372 K. Poonpon

each score followed by a verbal report. Then, a third set of responses was scored for language use, and each score was followed by a verbal report. The final set of responses was scored for topic development, followed by verbal reports. In the second part, they only scored, following the same steps as they did in the first part but without giving verbal reports. Data Analysis

The study employed a mixed-method approach of data analysis. Both quantitative data and qualitative data were analyzed to triangulate each other in order to answer the research questions. The quantitative analysis was conducted for each of the two RLS tasks (Tasks 3 and 4), as required by the assumption of independent observations in multiple regression.

Quantitative Data Quantitative data analyses involved two sets of data: occurrences of linguistic features in the spoken responses and the raters’ scores. The linguistic analyses were conducted to search for features, predicted in the literature, that distinguish performance across score levels for delivery, language use, and topic development (Table 1). For delivery, speech rate, filled pauses, and pause/hesitation phenomena were measured. Values for these variables served as independent variables which were regressed on the scores to support their relative importance in accounting for variance. For language use, transcripts of the spoken responses were used to analyze vocabulary range and richness as well as grammatical accuracy. They were also automatically tagged for lexico-grammatical features, using Biber’s (2001) tag program, to examine specific (i.e., complement clauses, adverbial clauses, relative clauses, and prepositional phrases) and data-driven features reflecting grammatical complexity. Mean and standard deviation were calculated for occurring features. Features without counts were deleted from the list. For text comparison purposes, the counts of the tagged features were normalized to 100 words. Then stepwise multiple regression was computed to see which features influenced language use scores. The analysis for topic development was conducted through an investigation of cohesive devices to inform text coherence, using the concordancing software MonoConc. The framework of the analysis was based on Halliday and Hasan’s (1976) lexico-grammatical model of cohesion with some adjustments for clearer operational definitions of the conjunction and collocation relations. Stepwise multiple regression was computed to see which features influenced topic development scores.

The raters’ scores were analyzed using both traditional and FACETS analyses. First, estimates of inter-rater reliability between the two trained raters were computed. In each dimension, discrepant responses were listened to by a third rater (the researcher) and the most frequent score was used. Descriptive statistics were computed for each dimension score and the holistic score to determine the degree to which the raters actually used the expanded scale. The FACETS analysis allowed the researcher to look at the meaningfulness of the expanded levels of the scale.

Page 81: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

73Expanding a Second Language Speaking Rating Scale for Instructional and Assessment Purposes

Table 1. Measures for Delivery, Language Use, and Topic Development Measures Operationalization Delivery

1. Speech rate Number of syllables (excluding fillers) per real expressed time normalized to 60 seconds

2. Filled pauses Number of filled pauses per real expressed time normalized to 60 seconds

3. Content planning pauses Mean number of content planning pauses produced in a given speech per 60 seconds

4. Grammatical/lexical planning pauses

Mean number of grammatical planning pauses produced in a given speech per 60 seconds

5. Adding examples/extra information pauses

Mean number of addition pauses in a given speech per 60 seconds

6. Grammatical/lexical repair pauses

Mean number of grammatical/lexical repair pauses in a given speech per 60 seconds

7. Propositional uncertainty pauses

Mean number of propositional uncertainty pauses in a given speech per 60 seconds

Language Use 8. Vocabulary richness Proportion of high- and low-frequency words used in a

spoken text 9. Vocabulary range Type/token ratio in a spoken text 10. Grammatical accuracy Mean proportion of error free c-units in a spoken text

Specific grammatical complexity 11. Complement clauses

Normalized counts of complement clauses in a spoken text*

12. Adverbial clauses Normalized counts of adverbial clauses in a spoken text*

13. Relative clauses Normalized counts of relative clauses in a spoken text* 14. Prepositional phrases Normalized counts of prepositional phrases in a spoken

text* 15. Data-driven (grammatical

complexity) features Normalized counts of lexico-grammatical features in a spoken text*

Topic Development 16. Reference devices for

cohesion Normalized counts of reference devices (e.g., these, here, there) in a spoken text*

17. Conjunction devices for cohesion

Normalized counts of linking adverbials (e.g., first, then, so, however) in a spoken text*

Note. * Counts were normalized to 100 words.

Page 82: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

7574 K. Poonpon

Qualitative Data Data derived from think-aloud protocols were coded and analyzed, using Miles and Huberman’s (1994) framework, by tasks, dimensions, and score levels. The researcher and her second coder underlined pieces of information that fall into any of the three dimensions of the scale, labeled the pieces of information according to dimension, characteristic, and score, and wrote down ideas about the labels and their relationships. The ideas were then systematized into a coherent set of explanations. The researcher and the second coder discussed themes, grouped related themes, and renamed the combined category.

Results RQ1: What are the features which distinguish examinees’ performances for delivery, language use, and topic development? Spoken Features Distinguishing Examinee’s Delivery Performances

The quantitative data analysis shows that speech rate and content planning pauses significantly predicted the examinees’ language performance for Task 3, accounting for 76% of the total variance in examinees’ delivery performance [R2 = .76, F(2, 55) = 87.45]. The positive relationships between these features (speech rate, b = .70; content planning pauses, b = .30) and the delivery scores indicate that the faster the speech rate and the more content planning pauses used in spoken responses, the higher the examinees’ delivery performance levels. For Task 4, only speech rate accounted for approximately 72% of the total variance in delivery scores [R2 = .72, F(1, 59) = 152.89] with a very strong influence (b = .85). It indicates that the more syllables produced by the examinees, the higher the examinees’ delivery performance levels. In combination with the qualitative analyses, Table 2 summarizes different features that were likely to distinguish examinees between adjacent score levels. For example, between score levels 1 and 1.5, slow speech rate and a small number of content planning pauses were used to distinguish among the speakers. Also the delivery features that were considered included pronunciation at word and phrasal levels and short responses (as predictors for level 1) as well as fluidity, hesitations, clarity of speech, intonation, speaker’s attempt to respond, and listener effort (as predictors of level 1.5). Between scores 3 and 3.5, speaker’s confidence in delivering speech, pauses for word search or content planning, and native-like pace and pauses may distinguish spoken responses. Native-like or natural delivery in general could be the features that can potentially distinguish spoken responses at a score of 3.5 from a score of 4. Spoken Features Distinguishing Examinee’s Language Use Performances

For language use, the quantitative analyses revealed that type/token ratio, low/high frequency, relative clauses, and sum stance adverbs were the best predictors for Task 3, accounting for 58% of the total variance in examinees’ language use scores [R2 = .58, F(1, 53) = 17.97]. The positive relationships between these features (type/token ratio, b = .40; low/high frequency, b = .33; relative clauses, b = .30; and sum stance adverbs, b = .21) and the examinees’ language use performance signified that the more these language features were found in the speech, the higher the examinees’ scores in language use. The analysis for Task 4 shows that the significant predictors included prepositional phrases, error free c-units,

Page 83: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

75Expanding a Second Language Speaking Rating Scale for Instructional and Assessment Purposes

relative clauses, and passives, accounting for 54% of the total variance in examinees’ language use scores [R2 = .54, F(1, 56) = 16.59]. The positive relationships between the language use scores and these features (prepositional phrases, b = .29; error free c-units, b = .37; relative clauses (b = .37), and passives (b = .24) explained that the more these features were produced by the speakers, the higher language use scores they received.

Together with the quantitative data, features driven by raters’ verbal reports were likely to distinguish between adjacent score levels (Table 3). For example, the features separating examinees from 1.5 and 2 included speakers’ attempt to use complex structures, despite no success, and the speakers’ lack of confidence in producing language. An automatic use of advanced vocabulary and a wide range of complex structures were characteristics that distinguished the speech samples that received scores 3.5 and 4. Table 2. Salient Features for Delivery at Adjacent Score Levels

Score Quantitative results Qualitative results 1

Slower speech rate A smaller number of content planning pauses

Pronunciation at word and phrasal levels Short length of speech Unintelligible speech

1.5 Not fluent Lots of hesitations Unclear speech Problematic intonation Speaker’s attempt to respond to the task A great deal of listener effort

2 Choppy pace L1 influence Listener effort

2.5 Repetition of words/phrases and use of false starts Unclear pronunciation Monotone speech

3 Effects of use of fillers -- distracting and challenging the listener

3.5

Speaker’s confidence in delivering speech Pauses for word search “Nativeness” or “naturalness” of intonation, pace, and pauses

4 Faster speech rate A larger number of content planning pauses

“Nativeness” or “naturalness” of overall delivery Pauses for information recall

Page 84: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

7776 K. Poonpon

Table 3. Salient Features for Language Use at Adjacent Score Levels Score Quantitative results Qualitative results

1 A smaller type/token ratio A smaller proportion of low/high frequency A smaller number of error free c- units Less use of stance adverbs, prepositional phrases, relative clauses, and passives

Length of speech

1.5 Repetitions of words/phrases (including key words from the prompt)

2 Speakers’ attempt to use complex structures, but not successful Speakers’ lack of confidence in producing language

2.5 Use of repairs or self corrections Nominal substitution

3 Ability to use developed complex structures

3.5

Repetitions of sophisticated words Minor errors of vocabulary and grammar use

4 A larger type/token ratio A larger proportion of low/high frequency A larger number of error free c-units More use of stance adverbs, prepositional phrases, relative clauses, and passives

An automatic use of advanced vocabulary A wide range of complex structures

Spoken Features Distinguishing Examinee’s Topic Development Performances

The quantitative analysis show that for Task 3 linking adverbials were a significant predictor, accounting for 7% of the total variance in topic development scores [R2 = .07, F(1, 56) = 4.06]. The positive relationship between this feature and the topic development scores (b = .26) indicated that the more linking adverbials were produced, the higher scores the examinees received for topic development. For Task 4, multiple regression could not be

Page 85: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

77Expanding a Second Language Speaking Rating Scale for Instructional and Assessment Purposes

computed because both linking adverbials and reference devices were not significantly correlated with the topic development scores.

The qualitative analysis exhibit various features that may influence the raters while scoring topic development scores. In combination with the quantitative results, Table 4 summarizes the features that were likely to distinguish examinees’ performance on topic development. The feature distinguishing between scores 1 and 1.5, for example, was a serious lack of relevant information required by the task, poor connection of ideas, and text coherence. Inclusion of all key points required by the task and good synthesis of the prompt were likely to separate speakers at score level 3.5 from score level 4.

Table 4. Salient Features for Topic Development at Adjacent Score Levels Score Quantitative results Qualitative results

1 Serious lack of relevant information required by the task

1.5 Poor connection of ideas and text coherence

2 Inaccuracy and vagueness of information

2.5 Difficulty in developing speech and connecting ideas at times

3 Speakers’ comprehension of the stimulus or task Some accuracy of ideas

3.5 Inclusion of introduction and conclusion

4

( Smaller number of linking adverbials)

( Larger number of linking adverbials)

All key points required by the task Good synthesis of prompt

RQ2: What are the relationships between dimension scoring profiles and the ETS holistic scores?

Relationships between dimension scoring profiles (using the expanded scores) and the ETS holistic scores (i.e., 1, 2, 3, and 4) were examined to (1) find different scoring profiles used by novice raters in scoring the 119 spoken responses receiving particular holistic scores, and (2) examine which among these profiles most contributed to each holistic score level. By listing dimension consensus scores of the responses classified into each ETS holistic score level, different profiles were drawn to illustrate how the consensus scores for individual dimensions of particular responses were identical, greater, or less than the holistic score. To report the profiles, frequency and percentage of the scoring patterns were used to reveal which one(s) among the emerging profiles most accounted for each ETS holistic score.

Page 86: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

7978 K. Poonpon

Before looking at the results, scoring guidelines for the TOEFL iBT’s integrated spoken responses are restated here to help the reader understand the results to be reported. The scoring guidelines (Educational Testing Service, 2008) explain that a rater must consider all three performance features. To score at 4 (the highest level), a response must meet all three performance dimensions at that level. For example, a response that receives a score of 3 for delivery, 4 for language use, and 4 for topic development would receive a score of a 3 and not a 4. To score at 3, 2, or 1 level, a response must meet a minimum of two of the performance criteria at that level. For example, a response receiving a score of 3 for delivery, 2 for language use, and 2 for topic development would receive a score of 2.

In the results for this research question, not only were different scoring profiles drawn, but the scoring profiles the raters gave were also compared with the expected scoring patterns addressed in the ETS scoring guidelines. The results report both expected scoring profiles (i.e., corresponding to the ETS scoring guidelines) and the unexpected profiles (i.e., not corresponding to the guidelines). Dimension Scoring Profiles for ETS Holistic Score 4

For the total of 20 spoken responses that received an ETS holistic score of 4, the raters’ scoring, when using the expanded scale, can be categorized into five scoring profiles (Table 5). Among these, profile A was the expected profile, showing that the raters assigned score 4 to six responses (30%) under all three performance dimensions. (The symbol “= 4” represents correspondence of score 4 from the expanded scale to the ETS score guidelines, indicating that responses that received a score of 4 must meet all three performance features at the score level 4.)

The other four profiles were the unexpected profiles as they did not correspond to the ETS guideline. For example, profile, B showed that the raters thought two of the performance features (i.e., delivery and language use) were equal to 4 (or = 4) while the topic development performance was judged to be less than 4 (or < 4). This pattern accounted for 20% of all responses. Table 5. Dimension Scoring Profiles for Holistic Score 4

Dimension Scoring Profile

Delivery Language Use

Topic Development

Frequency (N = 20)

% (100)

A = 4 = 4 = 4 6 30 B = 4 = 4 < 4 4 20 C = 4 < 4 < 4 2 10 D < 4 < 4 = 4 2 10 E < 4 < 4 < 4 6 30

Dimension Scoring Profiles for ETS Holistic Score 3 Scores assigned by the novice raters for the 39 spoken responses that received an ETS holistic score of 3 patterned into fifteen scoring profiles (Table 6). Among these fifteen scoring profiles, profiles A to F were the expected scoring profiles as pointed out in the ETS guidelines that a response at score 3 must meet a minimum of two of the performance criteria at the level (or at least two of the symbol “= 3” in each profile). These six scoring profiles revealed the raters’ scores of 3 for at least two performance dimensions, accounting for nearly 26% of the holistic score level 3 (i.e., 10 responses out of a total of 39).

Profiles G to I were also considered expected profiles because, as addressed in the ETS guidelines for scoring a 4, the raters did not assign a score of 4 for all three dimensions,

Page 87: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

79Expanding a Second Language Speaking Rating Scale for Instructional and Assessment Purposes

which broke the rule for score 4. Instead, the raters scored the responses a 4 for two dimensions and a 3 for one dimension; thus, these responses received a score of 3, not a 4. About 28% of the holistic score level 3 (i.e., 11 responses out of a total of 39) can be explained by these three scoring profiles combined. The remaining profiles (46%), Profiles J to O, were considered unexpected, as they did not follow the ETS guidelines. Table 6. Dimension Scoring Profiles for Holistic Score 3

Dimension Scoring Profile

Delivery Language Use

Topic Development

Frequency (N = 39)

% (100)

A = 3 = 3 = 3 2 5.13 B = 3 = 3 < 3 2 5.13 C = 3 = 3 > 3 2 5.13 D < 3 = 3 = 3 2 5.13 E = 3 > 3 = 3 1 2.56 F > 3 = 3 = 3 1 2.56 G > 3 = 3 > 3 5 12.82 H > 3 > 3 = 3 4 10.26 I = 3 > 3 > 3 2 5.13 J = 3 > 3 < 3 2 5.13 K = 3 < 3 < 3 2 5.13 L < 3 = 3 < 3 2 5.13 M > 3 = 3 < 3 2 5.13 N > 3 > 3 > 3 9 23.07 O > 3 < 3 > 3 1 2.56

Dimension Scoring Profiles for ETS Holistic Score 2

There were thirteen dimension scoring profiles for the 40 spoken responses that received the ETS holistic score of 2 (Table 7).

Table 7. Dimension Scoring Profiles for Holistic Score 2 Dimension Scoring

Profile Delivery Language

Use Topic

Development Frequency (N = 40)

% (100)

A = 2 = 2 = 2 2 5 B = 2 > 2 = 2 3 7.5 C > 2 = 2 = 2 2 5 D = 2 = 2 > 2 1 2.5 E = 2 = 2 < 2 4 10 F = 2 < 2 = 2 1 2.5 G > 2 > 2 = 2 6 15 H > 2 = 2 > 2 2 5 I < 2 = 2 < 2 2 5 J = 2 < 2 < 2 1 2.5 K > 2 > 2 > 2 12 30 L > 2 > 2 < 2 3 7.5 M < 2 < 2 < 2 1 2.5

Page 88: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

8180 K. Poonpon

Six of these profiles (32.5%), profiles A to F, were expected scoring profiles because they corresponded to the ETS scoring guidelines for score level 2. In other words, in these six profiles, the raters assigned a score of 2 for at least two dimensions. These profiles accounted for 32.5% of the total responses at score level 2. Profiles G to M were unexpected at this score level. These profiles did not correspond to the ETS guidelines. That is, none of these profiles showed that the raters gave score 2 for at least two dimensions. The unexpected profiles accounted for 67.5% of the total responses (higher than the expected scoring profiles at this score level). Dimension Scoring Profiles for ETS Holistic Score 1

Five scoring profiles were found in the lowest score level (Table 8). Among these profiles, Profiles A, B, and C were congruent with the ETS guidelines for score level 1, accounting for 25% of the total responses at this holistic score level. They were similar to each other in that the responses in these profiles were assigned to a score of 1 for at least two performance dimensions. The other two profiles, profiles D and E, did not correspond to the ETS guidelines, thus were considered unexpected profiles. In both profiles, the raters thought that performance in at least two dimensions were better than a score of 1. The total responses at the holistic score 1 can be explained by profile D (30%) and by profile E (45%).

Table 8. Dimension Scoring Profiles for Holistic Score 1 Dimension Scoring

Profile Delivery Language

Use Topic

Development Frequency (N = 20)

% (100)

A = 1 = 1 = 1 3 15 B = 1 = 1 < 1 1 5 C = 1 > 1 = 1 1 5 D > 1 > 1 = 1 6 30 E > 1 > 1 > 1 9 45

RQ3: Are the expanded score levels functioning appropriately for delivery, language use, and topic development? This section reports results from the inter-rater reliability analysis for double ratings for holistic and dimension scores, followed by results from the FACETS analysis. As shown in Table 9, the inter-rater reliability analysis showed acceptable criterion reliability values (i.e., ranging from r = 0.69 to r = 0.88) (Brown & Hudson, 2002). For both tasks, holistic scoring showed the highest inter-rater reliability (r = 0.81), followed by delivery and language use (r = 0.78) and topic development (r = 0.75). Table 9. Inter-rater Reliability by Task Types and Score Dimensions

Score Dimensions Task 3 (N = 58)

Task 4 (N = 61)

Both Tasks (N = 119)

Overall/Holistic 0.88* 0.74* 0.81* Delivery 0.81* 0.74* 0.78* Language Use 0.80* 0.75* 0.78* Topic Development 0.81* 0.69* 0.75* * p < 0.05

Page 89: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

81Expanding a Second Language Speaking Rating Scale for Instructional and Assessment Purposes

The results from the FACETS analysis on the functioning of the expanded score levels for individual dimensions was based on (1) average measures, (2) “Most Probable from” thresholds, and (3) mean-square outfit statistics. Functioning of the Expanded Score Levels for Delivery The FACETS result shows that the average measure values were ordered from the lowest (-4.04) to the highest (5.25) when the score levels increased (Table 10). All of the “Most Probable from” thresholds for the rating scale increased as the score levels increased. These patterns indicate that the score levels for delivery were ordered appropriately. The score levels were distinguishable. Most of the mean-square outfit statistics were within the acceptable range of 0.5 to 1.5, except for the score level 4 with the outfit mean-square value of 1.6, suggesting that there could be an unexpected component in the ratings. Other than that, each score level was contributing to meaningful measurement as they were intended to distinguish degrees of performance. Table 10. Score Levels Statistics for Delivery

Score Levels Avge Measure Most Probable From OUTFIT Mnsq 0 -4.04 Low 0.7 1 -2.67 -5.60 1.2 1.5 -2.18 -2.73 0.7 2 -0.72 -1.81 0.5 2.5 0.93 -0.28 0.9 3 2.79 1.65 0.8 3.5 4.38 3.65 0.8 4 5.25 5.12 1.6

Note. “Low” shows the most probable score level at the low end of the scale. Score level 0 was present in the data but not included in the report due to a very small number (i.e., only one zero found in delivery). Functioning of the Expanded Score Levels for Language Use

Table 11 displays the higher Average Measures that corresponded to the higher score levels for the language use dimension. Table 11. Score Levels Statistics for Language Use

Score Levels Avge Measure Most Probable From OUTFIT Mnsq 0 -4.32 Low 0.80 1 -3.44 -5.16 0.80 1.5 -2.51 -3.63 1.20 2 -0.60 -2.14 0.80 2.5 0.80 0.21 0.80 3 2.82 1.39 1.10 3.5 4.22 3.92 0.90 4 5.45 5.41 0.90

Note. “Low” shows the most probable score level at the low end of the scale. Score level 0 was present in the data but not included in the report due to a very small number (i.e., only three zeros found in language use).

Page 90: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

8382 K. Poonpon

All of the “Most Probable from” thresholds values also show correspondence between the higher values and the higher score levels. These phenomena suggest that the score levels were functioning appropriately to differentiate degrees of performance. All of the mean-square outfit statistics appeared to be in the acceptable range of 0.50 and 1.50, with the lowest value at 0.80 and the highest value at 1.20. This evidence indicates that the score levels were used meaningfully to measure speaking performance as they were expected to. Functioning of the Expanded Score Levels for Topic Development

Similar to the evidence found in the delivery and language use dimensions, the score levels tended to be functioning properly for topic development (Table 12). This can be illustrated by the fact that the Average Measures and the “Most Probable from” Thresholds were ordered from the lowest to the highest when the score levels increased. In addition, the mean-square outfit statistics show that all of the values (ranging from 0.80 to 1.30) were in the acceptable range of 0.50 to 1.50. This suggested that the score levels were meaningfully contributing to measurement in topic development.

Table 12. Score Levels Statistics for Topic Development Score Levels Avge Measure Most Probable From OUTFIT Mnsq

0 -4.79 Low 0.80 1 -2.76 -5.34 1.30 1.5 -1.42 -2.12 1.00 2 -0.08 -0.90 0.80 2.5 0.64 0.66 0.90 3 2.12 0.86 0.80 3.5 3.28 3.30 0.80 4 3.89 3.55 1.10

Note. “Low” shows the most probable score level at the low end of the scale. Score level 0 was present in the data but not included in the report due to a very small number (i.e., only six zeros found in topic development).

Discussion and Conclusion The findings are discussed in four sections: (1) dimension scoring profiles in

comparison to the ETS holistic scores, (2) functioning of the expanded score levels, (3) spoken features that distinguished test takers’ performances in the delivery, language use, and topic development dimensions, and (4) how the evidence from these topics leads to an expansion of the TOEFL iBT’s integrated speaking scale and development of a rating guide for the three dimensions.

Dimension Scoring Profiles and the ETS Holistic Scores The analysis of scoring profiles from emerging data in each dimension shows interesting relationships between dimension scoring profiles and the ETS holistic scores and the extent these profiles contributed to each holistic score level. Using the ETS scoring guidelines as criteria for scoring profiles, different scoring profiles were found for each holistic score level. Among these profiles, the expected profiles for all holistic score levels,

Page 91: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

83Expanding a Second Language Speaking Rating Scale for Instructional and Assessment Purposes

except the holistic score 3, accounted for a smaller percentage of contribution to the holistic scores of the total responses than the unexpected profiles. This indicates that when raters attended to individual dimensions separately, they concentrated on each dimension freely. In other words, their scores were not masked by the other dimensions while assigning holistic scores (Brown, Iwashita, & McNamara, 2005).

Although there were a small number of expected scoring profiles, dimension scoring manifests the possibility for the raters to score the TOEFL iBT’s spoken responses both analytically and holistically. The holistic scoring (as currently used with the TOEFL iBT’s speaking scale) is appropriate in many situations, where there are a large number of test takers and the test scores are used for admissions into higher education institutions. On the other hand, analytic scoring should be considered in low-stakes situations which focus on, for example, diagnosis or test takers’ improvement and feedback (e.g., in classroom settings). In this regard, the expanded scale with analytic scoring for each dimension could be a venue to allow novice raters or teachers to use the scale more easily and effectively in terms of indicating a student’s language progress. The differences between the dimension scores and the ETS holistic scores raise an important issue regarding scoring reliability. The score differences support the development of rating guides for all three dimensions to help improve the inter-rater reliability. Once the inter-rater reliability is improved, the test users can be more certain that the scores are used appropriately and clearly as intended. The clear and appropriate use of test scores then contributes to utilization of scores (e.g., how the scores inform what should be done in English curricula or language programs), which is one of the important components of building TOEFL iBT validity argument. Functioning of the Expanded Score Levels for the Three Dimensions As the purpose of this study was to expand the TOEFL iBT’s integrated speaking rating scale from four score levels (i.e., 1, 2, 3, and 4) to seven levels (i.e., 1, 1.5, 2, 2.5, 3, 3.5, and 4), it was necessary to explore whether or not the idea of using the seven levels in the expanded scale is plausible. The FACETS analysis provided evidence that generally confirmed appropriate functioning of the seven score levels for delivery, language use, and topic development. The seven expanded score levels have the potential to be used to distinguish test takers’ performance in the three dimensions. The appropriate functioning of the expanded scores in combination with high inter-rater reliability in each dimension indicated that the novice raters in this study were able to use the expanded scores consistently and to distinguish the score levels clearly. This not only supports the expansion of the speaking scale, but also signifies the potential for language teachers who are not familiar with scoring students’ speaking performance and for raters who are newly hired to be trained to use the seven expanded score levels. Spoken Features Distinguishing Test Takers’ Performances

In reconciling the findings of the acoustic, linguistic, and verbal protocol analyses, it was found that a combination of features could predict examinees’ speaking performance on delivery, language use, and topic development across score levels and tasks. The following sections discuss the salient features in each dimension followed by how these features were used to inform expansion of the TOEFL iBT’s integrated speaking rating scale with the development of new rating guides for delivery, language use, and topic development.

Page 92: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

8584 K. Poonpon

Spoken Features for Delivery Speech rate was found to be a significant predictor for Tasks 3 and 4. Content

planning pauses were a significant predictor in Task 3. As the significant predictor, speech rate supported evidence in previous studies on fluency in spoken language showing that speech rate was one of the best variables to distinguish between fluent and disfluent speakers (e.g., Derwing & Munro, 1997; Ejzenber, 2000; Freed, 2000; Kang, 2008; Kurmos & Dénes; 2004). The evidence of content planning pauses supports Fulcher’s findings from validation of a rating scale of oral language fluency (Fulcher, 1993, 1996a). The positive relation between these features and the delivery scores indicated that speakers with higher scores tended to speak faster and use more pauses to plan their content than those with lower scores. Insights obtained from the raters’ verbal protocols were also considered. The salient spoken features from the qualitative data reflect the features that the raters most often attended to at each score level. For delivery, at the lower end (i.e., scores 1 to 2), the raters paid attention to their difficulty in understanding the spoken responses. The understanding difficulty was usually caused by the speakers’ inability to sustain speech, poor pronunciation, limited intonation, very choppy pace, a lot of hesitations, and L1 influence. At the higher end (i.e., scores 2.5 to 4), the raters had a better understanding of the responses but still mentioned some speakers’ problems including use of repeated words, false starts, and fillers. The raters also turned their attention to the speakers’ ability to deliver the responses naturally.

As the goal of the study was to expand the TOEFL iBT’s integrated speaking scale with the development of new rating guides, the salient features derived from both acoustic and protocol analyses that tended to be in line with the TOEFL iBT description should be considered for inclusion in the expanded scale and rating guide. It is valuable when one feature drawn from one analysis was found to be consistent with another feature from another analysis. In the delivery dimension, for example, content planning pauses (from acoustic analysis) and pauses to recall information or content of the task (from raters’ protocol analysis) were consistent. This feature would then be included in the rating guide to distinguish speakers’ delivery performance. The following explanation focuses on how the delivery salient features were used in development of a rating guide for delivery. A Proposed Rating Guide for Delivery

Based on the quantitative and qualitative results of the study as well as evidence of the appropriate functioning of the expanded score levels in the delivery dimension, a new rating guide using seven binary questions in three levels is proposed (Figure 1). The development of the rating guide for delivery followed five steps, as suggested by Upshur & Turner (1995).

First, the salient delivery features from both acoustic features and verbal protocols for responses at score levels 2 and 2.5 were divided into the better and poorer. In this case, speech rate and the raters’ understanding of speech distinguish between the better and the poorer. Second, a criterion question was formulated to classify performances as “upper-half” or “lower-half.” This question was used as the first question in the rating guide (question 1), asking if the speaker produces “understandable speech at a natural speed.” Third, binary questions were developed, beginning with the upper-half. Working with the upper-half salient features, the features found in score levels 4 and 3.5, but not in 3 and 2.5 were selected. Here the clarity of speech and varied intonation were outstanding in the higher levels whereas the opposite qualities of speech were present in the lower ones. Thus, the criterion question asks if the speaker has “clear pronunciation and varied intonation” to distinguish the speakers from

Page 93: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

85Expanding a Second Language Speaking Rating Scale for Instructional and Assessment Purposes

4 and 3.5 from 3 and 2.5 (question 2a). Fourth, at the third level of the rating guide two criterion questions were formulated to discriminate score levels 4 from 3.5 and 3 from 2.5. By using the feature that is dominant in score 4 but not in 3.5, the question is “Does the speaker use pauses naturally (to recall information or plan content) before continuing his/her speech” (question 3a). A similar method was applied to distinguish the speakers in levels 3 and 2.5. The salient feature found in score 3 but not in 2.5 was used to generate a criterion question for these scores (question 3b). Because the speakers’ use of fillers, repetition of words, and false starts were found in score 3 but not in 2.5, the question was posed as “Does the speaker show his/her hesitancy by occasionally using fillers but with few repetitions and false starts?”

Fifth, steps 3 and 4 were repeated for the lower-half performances. This time the lowest score level of the TOEFL scale (i.e., 0) was included to complete the scale. Question 2b was formulated using the salient feature that was found in score levels 2 and 1.5, but rare in 1 and 0. Thus, the question asks if some listener effort is required. Then criterion questions were formulated to distinguish score level 2 from 1.5 and score 1 from 0. Question 3c was based on the dominant feature in score 2 (i.e., problematic pronunciation with a little L1 interference) but not in 1.5. Question 3d addressed the feature that made score 1 more salient than score 0 (i.e., speaker’s attempt to respond with partial intelligibility).

Delivery

Level 1 Yes No Level 2 Yes No Yes No Level 3

Yes No Yes No Yes No Yes No 4 3.5 3 2.5 2 1.5 1 0 Figure 1. A proposed rating guide for delivery.

Upper-half Lower-half

1) Does the speaker produce understandable speech at a natural speed?  

2a) Does the speaker have clear pronunciation and varied intonation?  

2b) Is only some listener effort required?  

3a) Does the speaker use pauses naturally

(to recall information or plan

content) before continuing his/her

speech?

3b) Does the speaker show

his/her hesitancy by occasionally using fillers but with few repetitions or false

starts?

3c) Does the speaker have problematic pronunciation but

with a little L1 influence?

3d) Is the speaker’s attempt to respond

partially intelligible?

Page 94: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

8786 K. Poonpon

Spoken Features for Language Use The analyses of theoretically-motivated and data-driven linguistic features for Tasks 3 and 4 yielded different predictors of the examinees’ oral performance in language use. For Task 3, type/token ratio, proportion of low and high frequency words, relative clauses, and stance adverbs were significant predictors. Error free c-units, prepositional phrases, relative clauses, and passives significantly predicted language use performance for Task 4. These results, similarly found by Brown, Iwashita, and McNamara (2005), indicate that task differences may affect language used by the speakers. In particular, different prompts or topics may result in the use of different linguistic features. Despite the fact that speakers used different linguistic features in different tasks, it is reasonable to combine the predictors from both tasks as one set of predictors for language use performance since the goal of the study was to develop one rating guide for the language use dimension.

Among the seven predictors of language use performance, three are word-level measures, another set of three represent features at phrasal and clausal levels, and one is related to grammatical accuracy. Type/token ratio and proportion of low and high word frequency are related to the use of words in spoken responses whereas stance adverbs reflect particular lexical features of linguistic complexity. Prepositional phrases, relative clauses, and passives represent the use of complex structures. Error free c-units address accuracy of language use.

The finding of type/token ratio contradicted the evidence found in a study by Daller, Van Hout, and Treffers-Daller (2003). The result found in this study showed that type/token ratio was larger for higher-level examinees than low-level examinees. Conversely, Daller, Van Hout, and Treffers-Daller found significantly larger ratios for lower-level speakers than higher-level speakers. Despite these different results, it is suggested that type/token ratio can be used to measure lexical richness of texts (Vermeer, 2000). Another predictor related to vocabulary use, proportion of low and high frequency words, was also found in Vermeer’s study (2000) on lexical richness in spontaneous speech. Vermeer’s findings suggested that analysis of different levels of lexical frequency by distinguishing basic (or high frequency) and advanced (or low frequency) levels of lexical items used by speakers can be utilized to measure lexical richness of speakers with different ability levels (Laufer & Nation, 1995). For particular lexical features, stance adverbs represent data-driven lexico-grammatical structures for linguistic complexity in this study. The use of stance adverbs signifies that the lexico-grammatical analysis allows the researcher to look at variability of language use at different levels, especially where language learners are likely to produce short language at word or phrasal levels (Poonpon, 2007; Rimmer, 2006).

The analysis of grammatical complexity motivated by the literature revealed prepositional phrases and relative clauses that represented high-level complexity as significant predictors. Usually these two structures occur more frequently in written language than in spoken language (Biber, Johansson, Leech, Conrad, & Finnegan, 1999); and grammatical features used in written language are likely to be more complex (Bygate, 2002; Norrby & Håkansson, 2007). In particular, Biber, Gray, and Poonpon (2009) have argued that prepositional phrases and relative clauses function as constituents in noun phrases. Occurrences of these two structures thus signify complex noun phrases (i.e., non-clausal features embedded in noun phrases); these embedded phrasal features represent “a considerably higher degree of production complexity” (Biber, Gray, & Poonpon , 2009, p. 22). When the speakers used more complex structures that are common in written language,

Page 95: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

87Expanding a Second Language Speaking Rating Scale for Instructional and Assessment Purposes

this showed their ability to construct a higher-level of complex grammatical features (Biber, Gray, & Poonpon, 2009). In other words, higher-level speakers are likely to use more complex structures in their oral language than lower-level speakers (Lennon, 1990). The finding of the study supports this theoretical position because the use of prepositional phrases and relative clauses was found to distinguish among examinees with high and low proficiency. It also promotes subordinated clauses (Biber, Gray, & Poonpon, 2009; Foster & Skenhan, 1996; Norrby & Håkansson, 2007), particularly dependent clauses functioning as constituents in a noun phrase, and phrasal dependent structures functioning as constituents in noun phrases (Biber, Gray, & Poonpon, 2009; Rimmer, 2008) as plausible measures of grammatical complexity for oral discourse. These complexity measures allow researchers to capture complex structures at non-clausal levels.

The finding of error free c-units agrees with the studies on scoring of the TOEFL iBT speaking test (Brown, Iwashita, and McNamara, 2005; Iwashita, Brown, McNamara, & O’Hagan, 2007). These studies use error free c-units to measure global accuracy of test takers’ language use. They recommended this global measure for measuring grammatical accuracy because error free c-units were found to significantly predict test takers’ ability to use oral language. In addition to the findings from the linguistic analysis, the data from the raters’ protocols were considered. At the lower ends (i.e., scores 1 to 2), the raters attended to the speakers’ lack of vocabulary and grammar leading to use of basic vocabulary and structures. They also mentioned a lot of grammatical errors. These features were found less at the higher scores (i.e., scores 2.5 and 3). Obviously at scores 3.5 and 4, the raters addressed more accurate use of vocabulary and grammar as well as a wider range and control of advanced vocabulary and complex structures.

Some of the salient features corresponded to the significant predictors found in the prior analysis of grammatical complexity for language use. Different complex structures were mentioned at different score levels (i.e., relative clauses and prepositional phrases in level 2, that complement clauses and if-then in level 2.5, passive voice in level 3, adverbial clauses in level 3.5, and adverbs, that complement clauses, noun clauses, and relative clauses in level 4). The raters’ attention to the use of relative clauses by the examinees agreed with the finding from the quantitative data. Moreover, another salient feature mentioned by the raters that concurred with the results from the linguistic analysis includes use of advanced vocabulary by the speakers at score level 4. This issue was congruent with the finding of proportion of low and high frequency words. The raters’ comments on vocabulary use at the word level in score level 1.5 were well-matched with another predictor, lexical verbs, found in the linguistic analysis. These consistent features were valuable for development of the rating guide because they can distinguish examinees’ oral language performance on language use. The insights from raters’ verbal reports and the findings from the linguistic analysis were used in developing the rating guide. A Proposed Rating Guide for Language Use

A rating guide for language use (Figure 2) was proposed based on the evidence from quantitative and qualitative results as well as the appropriate functioning of the expanded score levels for language use. Following suggestions in developing a binary-question scale (Upshur & Turner, 1995), the development contained five steps (the same steps as described for delivery). First, the salient language use features from both linguistic analysis and verbal

Page 96: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

8988 K. Poonpon

protocols for responses at score levels 2 and 2.5 were divided into the better and poorer. Second, the criterion question was then formulated to classify performances as “upper-half” or “lower-half.” Because complex structures and advance vocabulary could distinguish between the better and the poor, this question asks, “Does the speaker use complex structures and advanced vocabulary (question 1).

Language Use Level 1 Yes No Level 2 Yes No Yes No

Level 3

Yes No Yes No Yes No Yes No 4 3.5 3 2.5 2 1.5 1 0 Figure 2. A proposed rating guide for language use.

Third, binary questions were developed, beginning with the upper-half. The salient

features that were present in score levels 4 and 3.5, but not in 3 and 2.5 were selected to distinguish the speakers at these scores. Here the ability to use vocabulary and grammar with accuracy was prevalent in the higher levels. Thus, the criterion question was posted, “Does the speaker use vocabulary and grammar with minor or no errors” (question 2a). Moving to the third level of the rating guide, two criterion questions were formulated to discriminate score levels 4 from 3.5 and 3 from 2.5. Question 3a is based on a feature that is outstanding in score 4 but not in 3.5. Thus, the question asks, “Does the speaker have strong control and

Upper-half Lower-half

1) Does the speaker use complex structures and advanced vocabulary?

 

2a) Does the speaker use vocabulary and grammar with minor or no errors?  

2b) Is the speaker able to construct grammar at sentence level?

 

3a) Does the speaker have strong control and wide range of complex structures

& advanced vocabulary?

3b) Does the speaker have limited

range of complex structures &

vocabulary, but it does not interfere

with communication?

3c) Do the speaker’s range and control of

grammar & vocabulary lead to some accuracy of

language use?

3d) Do the speaker’s range and control of

grammar and vocabulary help

him/her to produce isolated words or short utterances?

Page 97: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

89Expanding a Second Language Speaking Rating Scale for Instructional and Assessment Purposes

wide range of complex structures and advanced vocabulary?” Similarly, question 3b used the salient feature found in score 3 but not in 2.5 to generate a criterion question. The question was then posted as “Does the speaker have limited range of complex structures and vocabulary, but it does not interfere with communication?”

Steps 3 and 4 were subsequently repeated for the lower-half performances. The lowest score level of the TOEFL scale (i.e., 0) was included to complete the scale. Question 2b was formulated using the salient feature that was found in score levels 2 and 1.5, but rare in 1 and 0. The question asks if the speaker is able to construct grammar at sentence levels. Two more criterion questions were formulated to distinguish the speakers at score level 2 from 1.5 and score 1 from 0. Based on the dominant feature in score 2 not in 1.5, Question 3c asks whether the speakers’ range and control of grammar and vocabulary show some accuracy of language use. Question 3d addressed the feature that made score 1 more salient than score 0 (i.e., ability to produce speech at word level), posting whether or not the speakers’ range and control of grammar and vocabulary help the speaker to produce isolated words or short utterances. Spoken Features for Topic Development

The linguistic analysis of spoken responses revealed that only linking adverbials were likely to be predictors of the examinees’ performance on topic development in Task 3 and not in Task 4. This finding corresponds to corpus-based facts regarding common uses of linking adverbials in spoken language (Biber, Johansson, Leech, Conrad, & Finnegan, 1999). This finding also shared similarities with Ejzenberg’s study (2000) in that the speakers who received high speaking scores had a higher proportion of the cohesive devices used to link and organize their monologic talks. Thus, high-level speakers were better able to use coordinating conjunctions and adverbial conjunctions to provide continuity in their speech. Despite this result, linking adverbials accounted for a small proportion (only 7%) of the total variance in topic development performance. Moreover, reference devices did not show any significant relationship or predictive ability in either task. This evidence indicated that the examinees rarely used cohesive devices to connect ideas in their spoken texts. The scarce use of cohesive devices found in the study might reflect linguists’ thoughts that coherence does not reside in the text, but in the meaning of a message conveyed between the text and its listener or reader (e.g., Brown & Yule, 1983; Stoddard, 1991). Because linking adverbials hardly contribute to the score variance for topic development, the raters’ protocols were mostly relied on. The information obtained from the raters showed that the outstanding features at the lower end (i.e., scores 1 to 2) included lack of relevant information required by the task, undeveloped ideas, poor connection of ideas and text coherence, and inaccuracy and vagueness of information. The salient features at the higher end (i.e., scores 2.5 to 4) involved better qualities of topic development performance in terms of inclusion of more relevant information, clear connection and organization of ideas, and accuracy of information.

A few salient features were not explicitly included in the original TOEFL scale. These features include speakers’ comprehension of the stimulus or prompt at score 2, inclusion of introduction and conclusion at score 3.5, and speakers’ ability to synthesize the prompt at score 4. The raters’ attention to other features apart from those in the scale signified either their different interpretations of the scale description or distraction by other features relevant to the explicitly stated description. In particular, some raters might interpret “a clear progression of ideas” as inclusion of introduction and conclusion parts in the spoken

Page 98: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

9190 K. Poonpon

responses. It also might be unavoidable for some raters to refer to the examinees’ ability to understand the prompt when they heard some incorrect information in the spoken responses. A Proposed Rating Guide for Topic Development

The proposed rating guide for topic development (Figure 3) was informed by the evidence from qualitative data and the appropriate functioning of the expanded score levels. Similar to the rating guides for delivery and language use, binary question ideas and development steps (Upshur & Turner, 1995) were applied to the development of the rating guide as follows.

Topic Development

Level 1 Yes No Level 2 Yes No Yes No

Level 3

Yes No Yes No Yes No Yes No 4 3.5 3 2.5 2 1.5 1 0 Figure 3. A proposed rating guide for topic development.

Upper-half Lower-half

1) Does the speaker produce most key ideas and relevant information required by the task?  

2a) Does the speaker show clear connection

and progression of ideas with introduction and

conclusion?  

2b) Does the speaker make some connections

of ideas, though it is poor?  

3a) Is the response “fully” developed

with complete ideas and enough detail?

3b) Does the speaker produce some accurate

ideas?

3c) Does the speaker give some detailed

information?

3d) Does the speaker produce at

least one idea although it is

inaccurate, vague, irrelevant, or

repetitive to the prompt?

Page 99: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

91Expanding a Second Language Speaking Rating Scale for Instructional and Assessment Purposes

To create the first criterion question, the salient features from verbal protocols for responses at score levels 2 and 2.5 were selected to classify performances as “upper-half” or “lower-half.” The features that could generally distinguish between the better and poorer involved inclusion of key ideas and relevant information in the responses. The first criterion question was formulated to ask “Does the speaker produce key ideas and relevant information required by the task?” (question 1).

Subsequently, binary questions were developed, beginning with the upper-half. The salient features that were present in score levels 4 and 3.5, but not in 3 and 2.5 were selected to distinguish the speakers at scores 4 and 3.5 from those at scores 3 and 2.5. At the scores 3.5 and 4, the outstanding features were related to clear connection of ideas including the use of introduction and conclusion in the responses. Question 2a then asked “Does the speaker show clear connections and progression of ideas with introduction and conclusion.” After that, at the third level of the rating guide, two criterion questions were formulated to discriminate score levels 4 from 3.5 and 3 from 2.5. Question 3a was related to the feature that was outstanding in score 4 but not in 3.5. The question asks, “Is the response “fully” developed with complete ideas and enough detail?” Question 3b was generated using the salient feature found in score 3 but not in 2.5. The question was then posted as “Does the speaker produce some accurate ideas?”

Working with the lower-half performances, question 2b was formulated using the salient feature that was present in score levels 2 and 1.5, but rare in 1 and 0. The question asks if the speaker is able to produce a response with some connection of ideas despite its poor quality. Then, two more criterion questions were formulated to distinguish the speakers at score level 2 from 1.5 and score 1 from 0. Based on the dominant feature in score 2 not in 1.5, Question 3c asks, “Does the speaker give some detailed information?” Question 3d addresses the feature that differentiate the speakers at score 1 from score 0. It was posed to ask whether or not the speaker produces at least one idea although it is inaccurate, vague, irrelevant, or repetitive to the prompt.

To conclude, the expanded scale with the rating guides provides evidence for empirically-based approach to scale development for oral language assessment to enhance the TOEFL iBT’s goal at strengthening the link between the test and test preparation of test takers in their contexts such as classrooms or school systems. Application of the rating guides for instructional and assessment purposes can become a springboard for future research on washback of TOEFL in language instruction worldwide. It also supports one of TOEFL iBT’s assumptions: promoting a positive influence on English language instruction.

Acknowledgements

I would like to express my gratitude to the English Language Institute at the

University of Michigan for providing the funding for this study. I am also grateful for the funding and a TOEFL iBT data set provided by the Educational Testing Service (ETS).

Page 100: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

9392 K. Poonpon

References

Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford University Press.

Bachman, L. F., & Savignon, S. J. (1986). The evaluation of communicative language proficiency: A critique of the ACTFL oral interview. The Modern Language Journal, 70(3), 380–390.

Biber, D. (2001). Codes for counts in ETS program “tag count.” Unpublished document. Biber, D., Gray, B., & Poonpon, K. (2009). Measuring grammatical complexity

in L2 writing development: Not so simple? Manuscript submitted for publication. Biber, D., Johansson, S., Leech, G., Conrad, S., & Finnegan, E. (1999). Longman grammar

of spoken and written English. Essex: Pearson Education Limited. Brown, A., Iwashita, N., & McNamara, T. (2005). An examination of rater orientations and

test-taker performance on English-for-Academic-Purposes speaking tasks. (TOEFL Monograph Series MS-29). Princeton, NJ: Educational Testing Service.

Brown, J. D., & Hudson, T. (2002). Criterion-referenced language testing. Cambridge: Cambridge University Press.

Brown, G., & Yule, G. (1983). Discourse analysis. Cambridge: Cambridge University Press. Butler, F. A., Eignor, D., Jones, S., McNamara, T., & Suomi, B. K. (2000). TOEFL 2000

speaking framework: A working paper. (TOEFL Monograph Series MS-20). Princeton, NJ: Educational Testing Service.

Bygate, M. (2002). Speaking. In R. B. Kaplan (Ed.), The Oxford handbook of applied linguistics (pp. 27–38). Oxford: Oxford University Press.

Clark, J. L. D., & Clifford, R. T. (1988). The FSI/ACTFL proficiency scales and testing techniques: Development current status, and needed research. Studies in Second Language Acquisition, 10(2), 129–147.

Daller, H., Van Hout, R., & Treffers-Daller, J. (2003). Lexical richness in the spontaneous speech of bilinguals. Applied Linguistics, 24(2), 197–222. Dandonoli, P., & Henning, G. (1990). An investigation of the construct validity of the ACTFL

proficiency guidelines and oral interview procedure. Foreign Language Annals, 23, 131–151.

Derwing, T. M., & Munro, M. J. (1997). Accent, intelligibility, and comprehensibility. Studies in Second Language Acquisition, 20(1), 1–16.

Educational Testing Service. (2008). Online scoring network. Princeton, NJ: Author. Retrieved June 3, 2008, from http://learnosn.ets.org/

Ejzenberg, R. (2000). The juggling act of oral proficiency: A psycho-sociolinguistic metaphor. In H. Riggenbach (Ed.), Perspectives on fluency (pp. 287–313). Ann Arbor: The University of Michigan Press.

Foster, P., & Skehan, P. (1996). The influence of planning and task type on second language performance. Studies in Second Language Acquisition, 18(3), 299–323.

Freed, B. F. (2000). Is fluency, like beauty, in the eyes (and ears) of the beholder? In H. Riggenbach (Ed.), Perspectives on fluency, (pp. 243–265) Ann Arbor: The University of Michigan Press.

Fulcher, G. (1987). Tests of oral performance: The need for data-based criteria. ELT Journal, 41, 287–291.

Page 101: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

93Expanding a Second Language Speaking Rating Scale for Instructional and Assessment Purposes

Fulcher, G. (1993). The construct validation of rating scales for oral tests in English as a foreign language. Unpublished PhD thesis, University of Lancaster, United Kingdom.

Fulcher, G. (1996a). Does thick description lead to smart tests? A data-based approach to rating scale construction. Language Testing, 13(2), 208–238.

Fulcher, G. (1996b). Invalidating validity claims for the ACTFL oral rating scales. System, 24, 163–172.

Fulcher, G. (1997). The testing of speaking in a second language. In C. Clapham & D. Corson (Eds.), Encyclopedia of language and education (pp. 75-85). Norwell, MA: Kluwer Academic Publishers.

Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English. London: Longman. Iwashita, N., Brown, A., McNamara, T., & O'Hagan. (2007). Assessed levels of second

language speaking proficiency: How distinct? Abstract retrieved January 10, 2008, from http://applij.oxfordjournals.org/cgi/content/abstract/amm017v1

Jamieson, J., Jones, S., Kirsch, I., Mosenthal, P., & Taylor, C. (2000). TOEFL 2000 framework: A working paper. (TOEFL Monograph Series MS-16). Princeton, NJ: Educational Testing Service.

Kang, O. (2008). The effect of rater background characteristics on the rating of International Teaching Assistants Speaking Proficiency. Spaan Fellow Working Papers, 6, 181–205.

Kormos, J., & Dénes, M. (2004). Exploring measures and perceptions of fluency in the speech of second language learners. System, 32(2), 145–164.

Laufer, B., & Nation, I. (1995). Vocabulary size and use: Lexical richness in L2 written production. Applied Linguistics, 16(3), 307–322.

Lennon, P. (1990) Investigating fluency in EFL: A quantitative approach. Language Learning, 40, 387-412.

Lowe, P. Jr. (1986). Proficiency: Panacea, framework, process? A reply to Kramsch, Schulz, and, particularly, to Bachman and Savignon. The Modern Language Journal, 70(3), 391–397.

Miles, M. B., & Huberman, A. M. (1994). Qualitative data analysis: An expanded source book. Thousand Oaks, CA: Sage.

Norrby, C., & Håkansson, G. (2007). The interaction of complexity and grammatical processability: The case of Swedish as a foreign language. International Review of Applied Linguistics, 45(1), 45–68.

North, B. (2000). Scale for rating language proficiency: Descriptive models, formulation styles, and presentation formats. (TOEFL Research Paper). Princeton, NJ: Educational Testing Service.

Poonpon, K. (2007). FACETS and corpus-based analyses of a TOEFL-like speaking test to inform speaking scale revision. Unpublished manuscript.

Rimmer, W. (2006). Measuring grammatical complexity: The Gordian knot. Language Testing, 23(4), 497–519.

Rimmer, W. (2008). Putting grammatical complexity in context. Literacy, 42, 29–35. Stansfield, C. W., & Kenyon, D. M. (1992). The development and validation of a simulated

oral proficiency interview. The Modern Language Journal, 76(2), 130–141. Stoddard, S. (1991). Text and texture: Patterns of cohesion. New Jersey: Ablex. Upshur, J. A., & Turner, C. E. (1995). Constructing rating scales for second language tests.

ELT Journal, 49, 3–12.

Page 102: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

PB94 K. Poonpon

Upshur, J. A., & Turner, C. E. (1999). Systematic effects in the rating of second-language speaking ability: Test method and learner discourse. Language Testing, 16(1), 82–116.

Vermeer, A. (2000). Coming to grips with lexical richness in spontaneous speech data. Language Testing, 17(1), 65–83.

Wall, D., & Horák, T. (2006). The impact of changes in the TOEFL examination on teaching and learning in Central and Eastern Europe: Phase 1, the baseline study. (TOEFL Monograph Series MS-34). Princeton, NJ: Educational Testing Service.

Wall, D., & Horák, T. (2008). The impact of changes in the TOEFL® examination on teaching and learning in central and eastern Europe: Phase 2, coping with change. (The TOEFL iBT Research Series RR-08--37). Princeton, NJ: Educational Testing Service.

Xi, X., & Mollaun, P. (2006). Investigating the utility of analytic scoring for the TOEFL academic speaking test (TAST). (TOEFL iBT Research Report TOEFLiBT-01). Princeton, NJ: Educational Testing Service.

Page 103: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

Spaan Fellow Working Papers in Second or Foreign Language AssessmentCopyright © 2010Volume 8: 95–116

English Language InstituteUniversity of Michiganwww.lsa.umich.edu/eli/research/spaan

95

Investigating Prompt Effects in Writing Performance Assessment Gad S. Lim Ateneo de Manila University University of Michigan

ABSTRACT   Performance assessments have become the norm for evaluating language learners’ writing abilities in international examinations of English proficiency. In these assessments, prompts are systematically varied for different test-takers, raising the possibility of a prompt effect and affecting the validity, reliability, and fairness of these tests. This study uses data from the Michigan English Language Assessment Battery (MELAB), covering a period of over four years (n ratings = 29,831), to examine this issue. It uses the multi-facet extension of Rasch methodology to investigate the comparability of prompts that differ on topic domain, rhetorical task, prompt length, task constraint, expected grammatical person of response, and number of tasks. It also considers whether prompts are differentially difficult for test takers of different genders, language backgrounds, and proficiency levels. The results show that, on the whole, test-takers’ scores reflect ability in the construct being measured and are generally not affected by a range of prompt dimensions, or test taker characteristics. It can be concluded that scores on this test and others whose particulars are like it have score validity.

Introduction

In international examinations of English language proficiency, performance assessment has become the norm in assessing the productive skills. Performance assessments require test takers to perform actual tasks that are similar or relevant to the knowledge, skill, or ability being measured, and success or failure are typically judged by human raters (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999; Kane, Crooks, & Cohen, 1999; McNamara, 1996). In assessments of second language writing, performance assessment has taken the modal form of the timed, impromptu writing test (Weigle, 2002). The use of performance assessment is in keeping with communicative approaches and conceptions of language ability, and compared to discrete item and indirect tests, these tests are seen as possessing greater theoretical and construct validity (Kane, et al., 1999; Linn, Baker, & Dunbar, 1991; Moss, 1992). In addition, they are thought to have the added value of providing positive washback (Miller & Legg, 1993).

Page 104: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

9796 G. S. Lim

However, there are also challenges associated with the use of performance assessments. Because performance assessments tend to require more time, examinees are typically tested on one or two tasks and evaluated on the basis of these limited samples. It is unclear whether performance on a small number of tasks is sufficient for representing a domain as apparently complex and multi-faceted as writing ability. That is, there is the risk of construct underrepresentation (Messick, 1989, 1994, 1996). Additionally, test takers are usually given one or two prompts from a larger pool of prompts. It is difficult to imagine that any two prompts will be completely comparable in every way, whether in and of themselves, or in interaction with different test-taker background characteristics. How comparable are the performances of a test taker who responds to one prompt and another test taker who responds to another prompt? In other words, there is also the risk of construct-irrelevant variance (Messick, 1989, 1994, 1996) or what Jennings, Fox, Graves, and Shohamy (1992) have called a “prompt effect.” These issues do not just raise questions about validity and reliability; perhaps more importantly, they raise questions of fairness (Kunnan, 2000), which examination providers must address.

This study aims to address those questions to a certain extent. Taking a look at one exam of English language proficiency—the Michigan English Language Assessment Battery (MELAB)—it investigates those characteristics that might contribute to prompts not being comparable, and determine whether prompt effects indeed exist.

Literature Review

As in all language use, responding to prompts requires topic knowledge. Where prompts are concerned, the usual approach of language proficiency exams is to use topics that all test takers are expected to know, and perhaps to give them a small selection of such topics (Bachman & Palmer, 1996). However, the question of relative prompt difficulty remains, and what makes a prompt easy or difficult still eludes people, test takers and test makers alike (cf. Chiste & O’Shea, 1988; Dobson, Spaan, & Yamashiro, 2003; Freedman, 1983; Hamp-Lyons & Mathias, 1994; Power & Fowles, 1998). A number of features (e.g. subject matter, rhetorical specification) have been identified that possibly contribute to prompts being easier or more difficult. Test taker characteristics such as gender and language background have also been identified that may interact with these features. These are now discussed. Subject Matter

First is subject matter or topic domain. While the topics used in exams are presumed to be familiar to all test takers, it remains that some test takers may have more expertise in a particular subject (e.g. medical professionals asked to talk about doctors) and thus have an advantage over other test takers. In Polio and Glew’s (1996) study on how students choose writing topics, the most often-cited reason was having background knowledge and perceived familiarity with the topic. These were also the reasons cited for choosing a topic in Powers and Fowles (1998).

However, that test takers are more familiar with a topic does not necessarily mean that they will perform better on them. Test takers in Powers and Fowles (1998) did no better on topics they preferred. When the English Language Testing System was being revised, the plan to divide test takers into six discipline areas was abandoned when it was found that there were

Page 105: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

97Investigating Prompt Effects in Writing Performance Assessment

no systematic differences in test-takers’ performances when responding to general and field-specific prompts (Hamp-Lyons, 1990). On the other hand, Tedick (1990) reports that ESL graduate students did better on topics specific to their field than on general topics. The prompts used in the study might be worth looking into, however. The general prompt is provided first, followed by the field-specific prompt:

In a recent news magazine, a famous educator argued that progress makes us lazy. Do you agree or disagree with this point of view? Explain why you believe that progress does or does not cause people to become more lazy or passive. Support your answer with specific reasons and examples.

Every field of study has controversial issues. Debate over these issues often occurs among professionals in the field and leads them to conduct research in order to look for evidence to support one position on the issue over another or others. Choose a current controversial issue in your [italics in original] field of study. Discuss the controversy and explain your position on the issue, being sure to provide examples to support your opinion. (p. 127) The general prompt is on a subject people can probably write about even if they have

not necessarily thought about it; in that way, it appears to fairly represent prompts such as are found in standardized writing assessments. However, the topic is constrained in that one can only write about progress and laziness and nothing else. The “specific” prompt, ironically, is the more general prompt. The field-specific prompt is virtually unconstrained, leaving respondents plenty of leeway in choosing what to write about. That the topic is controversial means that there are already two or more fairly well-sketched out positions on the matter. It is not difficult to imagine that people will have more to say about the latter than the former. Add the fact that the subjects in this study are graduate students, who are steeped in their particular fields, and significant findings are clearly not a surprise. From this study, a possible prompt factor emerges then: those that allow one to respond in a specific way (e.g. Do you agree or disagree regarding x?), and those that allow multiple possibilities (e.g. Give an example of y.). These can perhaps be called constrained and unconstrained prompts. Two possible prompt-related factors have been identified here. One is topic domain; the other, task constraint. Rhetorical Task

Studies on the type of writing called for in a prompt have by and large compared personal versus impersonal writing, or narrative versus argumentative writing. A number of studies have investigated performance on prompts that invited a personal, first person response versus those that called for impersonal, third person responses (Brossell & Ash, 1984; Greenberg, 1981; Hoetker & Brossell, 1989). These studies found no significant differences, though this lack of finding can perhaps be attributed to the cues being so subtle that test takers were not likely to pick up on them. Here, for example, are the sample prompts for personal and impersonal from Greenberg (1981, p. 94-95):

In most American colleges, students must pass required courses in English, math, and science before they are allowed to take courses in their major areas of study. Instead of making all students attend all of their required courses, colleges should offer more independent study programs in which

Page 106: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

9998 G. S. Lim

students could complete some of their courses on their own, working at their own pace. Do you agree or disagree with this statement? In an essay of about 300 words, explain and illustrate your answer in detail.

In most American colleges, students must pass required courses in English, math, and science before they are allowed to take courses in their major area of study. Instead of making all of you attend all your required courses, colleges should offer you more independent study programs in which you could complete some of these courses on your own, working at your own pace. Do you agree or disagree with this statement? In an essay of about 300 words, explain and illustrate your answer in detail.

It can be seen from the above that the difference between the two prompts are difficult

to spot. However, in the case of Hoetker and Brossell (1989), while there was no difference in the scores of compositions written in response to personal and impersonal prompts, the prompt did influence whether test takers wrote in the first or third person, and a separate ANOVA showed that raters gave significantly higher scores to first person essays than third person essays.

Other studies have focused on rhetorical task (Hamp-Lyons & Mathias, 1994; Hinkel, 2002; Quellmalz, Capell, & Chou, 1982; Spaan, 1993; Wiseman, 2009). These studies have found, contrary to the expectations of experts, that test-takers did better on argumentative tasks than on narrative tasks. Quellmalz, et al. (1982), in a well-controlled multi-trait, multi-method study of eleventh and twelfth grade writers, found that students received significantly lower scores on narrative prompts than on expository prompts. Wiseman (2009) looked at a college writing placement test and had the same findings. Similarly, Hamp-Lyons and Mathias (1994) found that argumentative/public compositions were scored higher than expository (narrative/descriptive)/private compositions in their sample of MELAB test takers. The one exception to these is Spaan (1993), who found that test takers performed better on narrative/personal prompts, though she offers that this might have been brought about by one of the argumentative/impersonal prompts being inaccessible to test takers: “What is your opinion of mercenary soldiers (those who are hired to fight for a country other than their own)? Discuss.” (p. 101). It should also be noted that performing “better” in this case meant a difference on average so small that individual test-takers’ final scores would have been the same. Task Specification

The way prompts are specified has received some amount of attention. A number of studies have looked into the amount of information provided in the prompt. Kroll and Reid (1994) divide prompts into three categories: bare prompt, framed prompt, and text-based or reading based prompt. The first is stated in relatively direct and simple terms (e.g., Do you favor or oppose x? Why?); the second presents a situation or circumstance, and the task is in reference to this; and the third has test takers read texts of some length and then interpret, react to, or apply the information in those readings. For his part, Brossell (1983) divides the first two categories into prompts that have low, moderate, and high information load. Brossell found that a medium level of specification resulted in longer essays and higher scores, though differences were not significant overall. In O’Loughlin and Wigglesworth (2007), tasks with

Page 107: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

99Investigating Prompt Effects in Writing Performance Assessment

less information elicited more complex language, but this difference in production did not affect scores.

Test takers do consider the generality and specificity of prompts in their decision-making when allowed to choose (Polio & Glew, 1996; Powers & Fowles, 1998), and have also been shown to prefer shorter prompts (Chiste & O‘Shea, 1988). This has not been to their advantage, though:

Shorter, simple declarative sentences may appeal in their brevity but ultimately offer less insight into an essay‘s development and structure. Longer topic sentences… provide more direction even as they frighten away the less able student. (Gee, 1985, p. 84, qtd. in Chiste & O‘Shea, 1988) The consensus appears to be that a medium level of specification is ideal.

Underspecified prompts require time and effort to narrow down, whereas very long prompts cause test takers to rely heavily on language and ideas in the prompt. A medium level of specification helps test takers focus without overloading them with information (Brossell, 1983; Lewkowicz, 1997).

Another approach to classifying prompt specification is by the number of tasks the test taker is asked to complete. Kroll and Reid (1994) provide this example prompt which, by their reckoning, asks the test taker to do 13 different things:

Some students believe that schools should only offer academic courses. Other students think that schools should offer classes in cultural enrichment and opportunities for sports activities as well as academic courses. Compare and contrast the advantages and disadvantages of attending a school that provides every type of class for students. Which of these types of school do you prefer? Give reasons and examples to support your choice. (p. 238) The 13 tasks in the prompt are identified as follows: identify the advantages and

disadvantages of (1, 2) each choice (3, 4); compare and contrast (5, 6) the advantages and disadvantages of (7, 8) each choice (9, 10); choose one of the choices (11) and give reasons and examples for the choice (12, 13). The claim here is that the larger the number of tasks required, the more difficult a prompt would be. However, this might not in fact be the case, as there is some evidence that both examinees and raters do not pay very much attention to whether all tasks in a given prompt are fulfilled, thereby rendering it a non-factor (Connor & Carrell, 1993). Test-Taker Characteristics

Investigations of test-taker characteristics that could interact with prompt-related factors have focused on gender, language background, and proficiency level. Where test-taker gender is concerned, Breland, Bridgeman, and Fowles (1999), Breland, Lee, Najarian, and Muraki (2004), and Broer, Lee, Rizavi, and Powers (2005) have found instances of differential item functioning (DIF) in favor of female test takers in six different performance writing tests, to a magnitude up to 0.2 of a standard deviation. The authors caution though that the direction and size of the differences are highly sensitive to sample selection, and the findings should not be generalized beyond the exams studied.

Page 108: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

101100 G. S. Lim

Studies have also considered the different production of writers from different language backgrounds on different tasks (Park, 1988; Reid, 1990). Reid, for example, studied the performance of writers whose first languages were Arabic, Chinese, English, or Spanish on a comparison and contrast task and on a graph/data commentary task. She found that writers from three of the language backgrounds, with the exception of the Spanish group, showed greater production on the graph task. There was also greater use of passive-voice in the comparison and contrast task for Arabic and Chinese writers, but not for English and Spanish writers. In Park’s study, differences in production were found according to language background and area of academic specialization.

A number of the studies have also investigated the relationship between prompt and language background (Breland, et al., 1999; Broer, et al., 2005). The study by Breland, et al. compared ESL Hispanics and Asian Americans to White Americans, and found the prompts favoring the latter by 0.72 to 0.76 standard deviation units. The Broer, et al. study found a moderate-sized difference in favor of those whose strongest language was English. Finally, Lee, Breland, and Muraki (2004) compared test takers with Indo-European and East Asian first languages. That is, where the comparison groups in other studies have been people for whom English is a first language, this study compared two groups of non-native English writers. There were small uniform and non-uniform DIF for a minority of prompts, but on the whole, the differences between the two groups were largely attributable to differences in English language ability, which is to say that the prompts show not item bias but item impact (Clauser & Mazor, 1998; Penfield & Lam, 2000; Zumbo, 1999); differential probabilities of success are likely because test takers actually differ in the ability of interest. In general, taking language background as a factor, there is a notable difference in findings depending on the comparison group; DIF is more likely to show up when test takers for whom English is a first language are included.

A test taker’s language ability might also partially determine whether prompts are or are not a factor in writing assessment. Studies that have considered this interaction are unanimous in showing that prompts are more of a factor among test takers at lower proficiency levels. In Spaan’s (1993) study, subjects were divided into beginning, intermediate, and advanced levels according to their reading and listening scores on the MELAB. While tests for significance were not conducted, beginners’ scores on the narrative/personal prompts and argumentative/impersonal prompts differed by 1.71 points, narrowed to 0.78 among intermediate-level test takers, and was further reduced to 0.03 for the advanced group. (It might also be worth noting that the former two groups received higher scores on the narrative/personal prompts, whereas the opposite was true for advanced learners.) Lee, et al. (2004), who compared test takers from Indo-European and East Asian language backgrounds, found that where non-uniform DIF existed, that language group membership had effects at low levels of language proficiency but not at higher levels. They attribute this finding to the lower- level test takers being more likely to resort to their first languages, which of course differ from English to different degrees.

Page 109: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

101Investigating Prompt Effects in Writing Performance Assessment

Research Questions In light of the literature, the following research questions can be asked:

1. To what extent can it be shown that there is no prompt effect related to topic domain, rhetorical task, prompt length, task constraint, expected grammatical person of response, or number of tasks?

2. To what extent are writing prompts not differentially difficult for test takers of different genders, language backgrounds, and proficiency level?

Method The Test

The MELAB is an advanced-level English proficiency test for adults who use English as a second or foreign language, and who use the scores for various academic and professional purposes. The test includes sections assessing each of the four language skill areas. In the writing section, examinees are given 30 minutes to compose a handwritten composition on one of two prompts, which test takers do not see in advance. Each composition is scored using a holistic, 10-point scale by at least two raters. If the two ratings differ by more than one scale-point, a third rater adjudicates. The final score is the average of the ratings that are either equal or different by one scale-point (English Language Institute, 2005). Examinees are allowed to request a rescore if they feel that the score they received is inaccurate; thus, there are potentially up to six ratings for each composition. The Prompts

The study’s data includes 60 different prompts. They range in length from 12 to 82 words, with a mean of 38.47 (Table 1). In terms of sentences they were as short as a single sentence and as long as five sentences. Table 1. Length of MELAB Writing Prompts Mean SD Min Max

Words 38.47 14.72 12 82 Sentences 3.17 0.98 1 5

Unlike length, the other prompt dimensions that the study is concerned with—topic domain, rhetorical task, task constraint, expected grammatical person of response, and number of tasks—cannot be arrived at by mere counting. These dimensions were independently coded according to the categories in Table 2 by two testing professionals with expertise in writing assessment. The categories for topic domain are those used internally by the ELI, while the categories for the other dimensions came out of the literature.

Page 110: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

103102 G. S. Lim

Table 2. Prompt Coding Categories Dimension Categories

Topic Domain

Business Education Personal Social

Rhetorical Task Argumentative Expository Narrative

Task Constraint Constrained Unconstrained

Grammatical Person of Response

First Person Third Person

Number of Tasks 1, 2, 3…, n

After initial coding, the two coders met for a reconciliation meeting to agree on a common code in instances where they disagreed. They also chose to leave certain “disagreements” as they were, rather than force an agreement that might misrepresent the nature of those prompts. Their agreement rates before and after the meeting are given in Table 3. Table 3. Prompt Coding Agreement Rates, Percentages

Topic Domain

Rhetorical Task

Task Constraint

Grammatical Person

Number of Tasks

Initial 92 83 75 85 85 After meeting 95 95 87 95 95

The Test Takers

The study’s data include all test takers who took the MELAB between October 2003 and February 2008, and all the ratings assigned to their compositions, minus those with missing data. The resulting sample included 29,831 ratings for 10,536 test takers. Those who took the MELAB in this time period were between 14 and 80 years old, and had an average age of just under 29 years old (SD = 11.1). Female test takers accounted for 57.29% of all test takers. The test takers came from more than 115 different first-language backgrounds. However, languages represented by less than 10 test takers were recoded under “other” categories by region, leaving 59 first languages. Those languages and language groups accounting for at least one percent of the total sample size are given in Table 4. (Language group refers to languages which have multiple dialects, e.g., Amoy, Cantonese, Hakka, and Mandarin were all coded under “Chinese”).

Page 111: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

103Investigating Prompt Effects in Writing Performance Assessment

Table 4. Well-Represented First-Language Backgrounds

Language Number Chinese 2248 Filipino 1259 Arabic 714 Farsi 670 Korean 542 English 438 Spanish 434 Punjabi 394 Russian 388 Urdu 372 Hindi 268 Romanian 222 Malayalam 173 Somali 164 Japanese 153 Gujarati 139 Bengali 120 Vietnamese 120 Portuguese 113 German 110

It should be noted that there are a number of test takers whose first language is English, and for whom the test is not designed. Johnson and Lim (2009) showed that the only effect of including these test takers is an underestimation of English first-language test-takers’ abilities. Estimates for all others are not significantly affected. Given those findings, the study chose to include English first-language test takers, with the caveat that findings related to those test takers be interpreted with appropriate caution. Data Analysis

To analyze the data, this study employed multi-facet Rasch (Linacre, 1989; 2006), which models different elements of interest and puts them on a common, interval scale, thus facilitating meaningful comparisons among elements. The model can account for rater effects, thus providing accurate estimates for prompts. In addition, bias analysis can also be performed, thus making it ideal for this study’s purposes.

In doing multi-facet Rasch analysis, it is important that the data be connected and that there be no “disjoint subsets”. Earlier it was noted that the MELAB writing test asks test takers to choose between two prompts and to respond to just one. This creates a problem with connectedness. If each person responds to only one prompt, it is impossible to tell if any differences observed are due to the prompt or to some characteristic of those persons who were assigned/who chose that particular prompt.

The approach taken by other studies to solving this problem is by creating matching variables—usually some overall language ability variable based on test-takers’ scores in other skill areas—and then matching different test takers according to their similarity in that regard (e.g., Breland, Lee, Najarian, & Muraki, 2004; Broer, Lee, Rizavi, & Powers, 2005; Lee,

Page 112: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

105104 G. S. Lim

Breland, & Muraki, 2004). This is arguably an imperfect solution, as it requires making certain assumptions regarding the relationship between writing and other skills. Additionally, identical overall scores can mask differing skill profiles.

The data used in this study permitted an approach that did not have to make such strong assumptions. The data include a large number of test takers who took the MELAB more than once. Thus, in this study, test takers were matched according to similarities in test scores and the fact that those being matched were in fact the same person. Elapsed time between test sittings provided an additional control; the less time between sittings, the less likely a person’s ability has changed. Taken together, there can be greater confidence that matches being made are warranted. A procedure was followed that maximized stringency while minimizing matches required. In total, a modest total of only 214 matches were required for data connection to be achieved. Full details of the matching procedure can be found in Lim (2009).

The software FACETS (Linacre, 2006) was used to perform multi-facet Rasch analysis. To fit the requirements of the software, the ratings—which in the original ten point scale ranged from 53 to 97—were converted into a 0 to 9 scale, where 0 = 53 and 9 = 97. A model was specified which included the following facets: test-taker, gender, first language, proficiency level, prompt, and rater. Proficiency level was a dummy variable anchored to zero. Bias analysis was also requested for prompt and gender, prompt and language background, and prompt and proficiency level.

To answer the first research question, the comparability of prompts was evaluated, prima facie, by looking at the prompt measurement report, which provides a variety of statistics regarding the prompts, individually and as a whole. Then, the fair measure averages for all 60 prompts were entered into SPSS 16.0 for Windows, along with their codes for the six prompt dimensions being investigated. Cases where coders chose not to agree were excluded. Separate analyses of variance (ANOVA) were then conducted for each of the six prompt dimensions. For each ANOVA, the categories within a dimension were the independent variables, and the fair measure averages were the dependent variables. The results of the F-test and the associated p-values for each ANOVA were examined for significant outcomes. Where significant outcomes are found, Levene’s test for homogeneity of variances and Tukey’s HSD post hoc test was used to see which categories were significantly different from each other.

The bias analyses from FACETS were examined to answer the second research question. In the output, the chi-square test examines the null hypothesis that all the combinations (e.g., of particular prompt and particular gender) are equal in difficulty. If the null hypothesis had to be rejected, and interaction effects were indeed present, the results were examined for appropriately measured values that were also significant – that is, those with z-scores higher than |1.96| and infit mean square values within the acceptable range. The difference between observed and expected scale point averages for those combinations were then examined to find out the direction and magnitude of the bias. Results and Discussion

To gauge the comparability of prompts, the difficulty parameters for the prompts are considered. The estimates are provided in Table 5, arranged in order of difficulty from the easiest to the most difficult. The separation index for this set of prompts was 5.85, with a

Page 113: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

105Investigating Prompt Effects in Writing Performance Assessment

reliability of .97, indicating that the prompts can be reliably separated into at least five different levels of difficulty. The fixed chi-square test had a p-value of .00; that is to say, the null hypothesis that the prompts are equal in difficulty must be rejected. The prompts ranged in difficulty from -0.96 to 1.82, or a range of 2.78 logits. In terms of the original scale, the fair average score for the most difficult prompt was 4.36, and 5.13 for the easiest prompt.

Figure 1. Range of Prompt Estimates, Arranged According to Severity

While the prompts significantly differ in difficulty, the real question is whether these significant differences are also meaningful. Figure 1 shows the difficulty measures of the prompts, accounting for standard error. It can be seen that Prompt 34 is a clear outlier, more than three standard deviations from the mean. The difficulty parameter of this prompt, allowing for standard error, is somewhere in the range of 1.74 and 2.00, whereas the range for the next most difficult prompt, Prompt 25, is between 0.96 and 1.10. In Figure 1, it is clearly seen that there is no overlap between the possible true parameter estimates for these two prompts, and thus they can unambiguously be separated into different difficulty levels. If just one outlier prompt were removed, the number of levels into which the prompts can be divided would immediately be reduced from five to four. In terms of logits, the range between the easiest and most difficult prompt would be reduced by almost a third from 2.78 to 1.99. If the next most difficult prompts were excluded—say, Prompt 25 and 33—the range between the easiest and most difficult prompts would be further reduced to just 1.83 logits.

-1.25

-0.75

-0.25

0.25

0.75

1.25

1.75

2.25

9 5 6 12

20

26

11

54

49 4 57 7 3 19 1 59 2 55

43

42 8 53

52

40

14

39

17

45

27

36

41

56

48

50

44

30

46

22

10

47

23

51

21

60

15

18

29

16

38

28

24

37

58

32

35

31

13

33

25

34

Log

its

Prompts

Page 114: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

107106 G. S. Lim

Tabl

e 5.

Pro

mpt

Mea

sure

men

t Rep

ort

-----------------------------------------------------

|Prompt| n Obsvd Fair |Measure S.E.| Infit |

| | Ave. Ave. | | MnSq ZStd |

|---------------------------------------------------|

| 9 | 321 5.3 5.13 | -.96 .10 | .8 -3 |

| 5 | 403 5.2 5.12 | -.95 .09 | .9 -1 |

| 6 | 359 5.3 5.12 | -.94 .10 | .9 -1 |

| 12 | 475 5.1 5.12 | -.94 .09 | .9 0 |

| 20 | 546 5.3 5.12 | -.92 .08 | .8 -3 |

| 26 | 277 5.3 5.11 | -.91 .11 | 1.0 0 |

| 11 | 620 5.1 5.07 | -.73 .08 | .7 -4 |

| 54 | 326 5.0 5.04 | -.63 .11 | .7 -4 |

| 49 | 682 5.0 5.03 | -.58 .07 | .8 -4 |

| 4 | 333 4.9 5.02 | -.56 .11 | .7 -4 |

| 57 | 201 5.4 5.01 | -.52 .13 | .9 -1 |

| 7 | 292 5.2 5.01 | -.49 .11 | .9 -1 |

| 3 | 343 4.9 4.99 | -.44 .10 | .7 -3 |

| 19 | 601 5.0 4.99 | -.42 .08 | .7 -4 |

| 1 | 149 5.0 4.98 | -.39 .16 | .8 -1 |

| 59 | 516 5.0 4.97 | -.34 .08 | .9 -1 |

| 2 | 243 5.0 4.96 | -.33 .12 | 1.0 0 |

| 55 | 1002 4.9 4.96 | -.32 .06 | .9 -2 |

| 43 | 330 4.8 4.95 | -.29 .10 | 1.2 2 |

| 42 | 148 5.2 4.95 | -.27 .15 | 1.0 0 |

| 8 | 729 4.9 4.95 | -.26 .07 | 1.0 0 |

| 53 | 495 4.9 4.93 | -.21 .09 | .6 -6 |

| 52 | 846 4.9 4.93 | -.19 .07 | .9 -2 |

| 40 | 427 4.8 4.92 | -.16 .09 | .8 -3 |

| 14 | 265 5.0 4.91 | -.14 .12 | 1.0 0 |

| 39 | 459 4.9 4.91 | -.13 .09 | .9 -1 |

| 17 | 691 5.0 4.91 | -.13 .07 | .8 -4 |

| 45 | 318 5.0 4.90 | -.10 .11 | 1.0 0 |

| 27 | 438 4.8 4.90 | -.07 .09 | .9 -1 |

| 36 | 812 4.8 4.89 | -.06 .07 | .7 -5 |

| 41 | 302 4.7 4.89 | -.03 .11 | .7 -4 |

| 56 | 260 4.8 4.89 | -.03 .12 | .7 -3 |

| 48 | 975 4.9 4.88 | -.02 .06 | .8 -4 |

| 50 | 797 4.9 4.87 | .03 .07 | .8 -4 |

-----------------------------------------------------

-----------------------------------------------------

|Prompt| n Obsvd Fair |Measure S.E.| Infit |

| | Ave. Ave. | | MnSq ZStd |

|---------------------------------------------------|

| 44 | 623 4.8 4.87 | .03 .08 | .8 -4 |

| 30 | 518 4.9 4.86 | .05 .08 | .9 -1 |

| 46 | 578 4.8 4.86 | .09 .08 | .8 -4 |

| 22 | 493 4.8 4.85 | .09 .09 | 1.0 0 |

| 10 | 264 4.8 4.85 | .11 .12 | .9 0 |

| 47 | 529 4.7 4.85 | .12 .08 | .8 -3 |

| 23 | 695 4.8 4.84 | .15 .07 | 1.0 0 |

| 51 | 427 4.8 4.84 | .16 .09 | 1.0 0 |

| 21 | 330 4.7 4.79 | .34 .11 | 1.0 0 |

| 60 | 105 5.2 4.79 | .34 .19 | 1.2 1 |

| 15 | 732 4.6 4.77 | .39 .07 | .8 -3 |

| 18 | 833 4.6 4.77 | .40 .07 | .9 -1 |

| 29 | 700 4.8 4.77 | .41 .07 | .9 -2 |

| 16 | 328 4.7 4.74 | .52 .11 | .6 -5 |

| 38 | 508 4.6 4.72 | .59 .09 | .9 0 |

| 28 | 232 4.6 4.72 | .59 .13 | .8 -1 |

| 24 | 486 4.8 4.71 | .62 .09 | 1.0 0 |

| 37 | 764 4.5 4.70 | .66 .07 | .8 -5 |

| 58 | 448 4.5 4.67 | .76 .09 | .9 -1 |

| 32 | 679 4.4 4.67 | .76 .07 | .9 -2 |

| 35 | 453 4.5 4.66 | .79 .09 | .7 -4 |

| 31 | 433 4.6 4.66 | .80 .09 | 1.0 0 |

| 13 | 404 4.5 4.64 | .87 .10 | .8 -3 |

| 33 | 633 4.4 4.61 | .96 .08 | 1.0 0 |

| 25 | 863 4.4 4.59 | 1.03 .07 | .7 -5 |

| 34 | 620 4.0 4.36 | 1.82 .08 | .8 -2 |

|---------------------------------------------------|

| Mean | 494.3 4.9 4.87 | .00 .09 | .9 -2.4 |

| S.D. | 211.8 .3 .15 | .57 .02 | .1 2.1 |

|---------------------------------------------------|

| RMSE (Model) .10 Adj S.D. .56 |

| Separation 5.87 Reliability .97 |

| Fixed chi-square: 2674.0 d.f.: 59 |

| significance: .00 |

-----------------------------------------------------

Page 115: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

107Investigating Prompt Effects in Writing Performance Assessment

Assuming that these three prompts (25, 33, and 34) were excluded, what is the practical effect of the easiest and most difficult prompt differing by 1.83 logits? As was previously mentioned, multi-facet Rasch makes meaningful comparisons between different facets possible, as the rating scale has also been expressed in terms of the same logit scale. In the case of this analysis, the average range covered by each scale point is 3.88 logits. On average, an advantage of 1.94 logits (50% of a scale point) would be necessary for one to get rounded off to the next higher score. Thus, if the three outlier prompts were excluded from the pool, even if the remaining prompts represent four different levels of difficulty, on average, the difference between the easiest and most difficult prompt—1.83—would have no practical effect on the score a person receives.

The above discussion can be restated in terms of the original scale. Including all 60 prompts, the difference between the easiest and the most difficult prompt is 5.13 - 4.36 = 0.77 points, or about three-quarters of a scale point. However, if the three prompts were to be excluded, the difference between the remaining easiest and most difficult prompt would be 0.5—or at just the halfway point between scale points. Reducing the pool of prompts to 57 would, on average, ensure that scores are not unduly affected because of prompt assignment.

That is, of course, only on average. For example, the decision point for most MELAB users is between scale points 4 and 5. Scale point 4 is wider than the average, spanning a logit range of 4.24. Thus, at the critical decision point, prompt difficulty would have to differ by 2.14 logits to have an effect. On the other hand, scale point 7 only covers a range of 2.94 logits, and differences in prompt difficulty would be more likely to have an effect on actual scores at that scale point. To ensure that there is no prompt-related effect in the test at any point along the scale, the difference between the easiest and most difficult prompt would have to be no larger than 1.47 logits. Approximately 14 of the easiest and most difficult prompts would need to be removed from the pool for this to happen.

Research Question 1

The previous section showed that differences in prompt difficulty do exist. It can be asked whether these differences are random, or if there are particular characteristics and qualities of prompts that make some of them systematically more difficult than others. Table 6 shows the average fair measure scores for different categories within each of the six prompt dimensions, arranged from the easiest to the most difficult. It can be seen that the largest spread between categories can be found within topic domain, about 0.15 of a scale point difference between prompts on education topics and prompts on social topics. For rhetorical task and prompt length, the spread was approximately 0.12 and 0.11, respectively. The spread was less than 0.05 for task constraint, grammatical person, and number of tasks. Table 6. Fair Averages for Categories within Prompt Dimensions

Topic Domain Rhetorical Task Prompt Length

n Fair Ave. n

Fair Ave. n

Fair Ave.

Education 6 4.98 Expository 30 4.90 2 sentences 14 4.92 Business 10 4.97 Argumentative 22 4.86 1 sentence 2 4.89 Personal 12 4.86 Narrative 5 4.78 3 sentences 20 4.87 Social 29 4.83 4 sentences 20 4.86 5 sentences 4 4.81

Page 116: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

109108 G. S. Lim

Task Constraint Grammatical Person Number of Tasks

n Fair Ave. n

Fair Ave. n

Fair Ave.

Unconstrained 12 4.88 Third Person 32 4.87 1 task 8 4.90 Constrained 40 4.87 First Person 25 4.87 3 tasks 21 4.89 4 tasks 6 4.87 2 tasks 22 4.86

Whether the above differences are significant or not can be determined by examining the results of the ANOVAs, which are reported in Table 7. Of the six prompt dimensions tested, only topic domain showed significant differences, F(3,53) = 3.858, p = .025. Differences in all other dimensions failed to reach statistical significance. Table 7. Prompt Dimensions Analyses of Variance

df

Between Group

Within Group

F

Sig.

Topic Domain 3 53 3.386 .025* Rhetorical Task 2 54 1.406 .254 Prompt Length 4 55 0.516 .724 Task Constraint 1 50 0.014 .905 Grammatical Person 1 55 0.017 .897 Number of Tasks 3 53 0.120 .948

For topic domain, a test for equality of variance (Levene’s statistic) showed that the

assumption of equal variances is valid. Thus, a post-hoc test using Tukey’s HSD was appropriate and was conducted to see where the significant difference or differences resided. The post-hoc test, contrary to the ANOVA, did not show any significant differences among the different topic domains (Table 8). However, an inspection of the p-values indicated that the difference between business prompts and social prompts, 0.14 of a scale point, was approaching significance. Table 8. Mean Differences and p-values for Post-Hoc Test

Col–Row (Sig.) Business Education Personal Social

Business .000 -.013 (.998)

.104 (.362)

.140 (.057)

Education .000 .117 (.394)

.153 (.106)

Personal .000 .036 (.888)

Social .000

Page 117: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

109Investigating Prompt Effects in Writing Performance Assessment

Significance aside, the difference between the two topic domains that may or may not be significant amounted to 0.14 of a scale point—not likely to make a difference in the final score in most situations. (It might also be worth noting that the outlier prompt identified earlier, Prompt 34, as well as 8 of the 12 most difficult prompts, relate to the social domain. Thus, the same process of excluding a few outlier prompts can likely take care of this problem without much difficulty.) The relatively small differences in scores obtained means that, no matter the topic domain assigned, test takers are generally able to produce compositions of comparable quality.

The general lack of findings here conforms to much of the literature. It has been noted, for example, that expected grammatical person of response is not usually very salient to test takers (Greenberg, 1981), and fulfillment of tasks given in a prompt is not usually an important consideration for raters (Connor & Carrell, 1993). Besides, tasks can differ in the length and complexity of response required, from one word (e.g., “Do you agree or disagree?”) to several paragraphs (e.g., “Discuss.”) Because of this, number of tasks just does not capture the complexity or difficulty of a prompt very well. For its part, task constraint was intended to capture the number of ways a test taker could respond to a prompt. It appears that having different ways of responding to a prompt was not all that important, given that (1) one only really needs to give one response, (2) the prompts are apparently generally accessible anyway, and if one prompt was not accessible, (3) test takers could choose to write on the other prompt. There was an apparent pattern where length of prompts is concerned; the longer the prompt, the lower the average score (Table 6). The only exception to this pattern was one sentence prompts. However, this relationship was not significant. It would appear, then, that reading a longer prompt might take somewhat more time, but not all that much, which accords with the findings of Polio & Glew (1996).

The one dimension that yielded significant differences was topic domain. Interestingly, in previous studies (Polio & Glew, 1996; Powers & Fowles, 1998) when asked what factors they considered in choosing prompts, test takers have overwhelmingly cited background knowledge and topic familiarity. Their intuition about what topic to choose is apparently correct as, in this test at least, topic domain seems to be the only dimension of prompts that might have an effect on scores. Research Question 2

The second research question concerns the relationship between prompts and test-taker characteristics. Results of the bias/interaction analysis between prompt and gender, language background, and test-taker proficiency level are given in Tables 9, 10, and 11, respectively. Provided in the tables are the global measures, as well as individual interaction measures that are significant (|z-score| > 1.96). It can be seen that for all three analyses, the significance of the chi-square tests was 1.00. That is, the null hypothesis that there is no differential effect should not be rejected. In all three analyses, the average difference between observed score and expected score for the different interaction terms was 0.01 of a scale point. In the case of prompt and language background, however, three combinations yielded significant results, two involving Sinhalese speakers, and one involving Spanish speakers. The significant results included bias in both directions, for and against indicated native speaker groups.

Page 118: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

111110 G. S. Lim

Table 9. Bias/Interaction Analysis: Prompt and Gender ----------------------------------------------------------------------- | Prompt x | Obs-Exp | Bias+ Model |Infit Outfit| | Gender | Average | Measure S.E. Z-Score| MnSq MnSq | |---------------------------------------------------------------------| | Mean (Count: 120) | .01 | -.04 .14 -.29 | .9 .8 | | S.D. | .01 | .03 .04 .20 | .2 .2 | |---------------------------------------------------------------------| | Fixed chi-square: 15.4 d.f.: 120 significance: 1.00 | ----------------------------------------------------------------------- Table 10. Bias/Interaction Analysis: Prompt and Language Background ----------------------------------------------------------------------- | Prompt x | Obs-Exp | Bias+ Model |Infit Outfit| | Language Background | Average | Measure S.E. Z-Score| MnSq MnSq | |---------------------------------------------------------------------| | 13 x Sinhalese (2) | -.83 | 3.56 1.33 2.69 | .9 .9 | | 43 x Sinhalese (2) | .84 | -3.03 1.26 -2.41 | .7 .7 | | 60 x Spanish (4) | -.55 | 2.44 1.02 2.40 | 2.0 2.1 | |---------------------------------------------------------------------| | Mean (Count: 2103) | .01 | -.04 .84 -.06 | .7 .7 | | S.D. | .05 | .19 .40 .21 | .8 .8 | |---------------------------------------------------------------------| | Fixed chi-square: 102.6 d.f.: 2103 significance: 1.00 | ----------------------------------------------------------------------- Table 11. Bias/Interaction Analysis: Prompt and Proficiency Level ----------------------------------------------------------------------- | Prompt x | Obs-Exp | Bias+ Model |Infit Outfit| | Proficiency Level | Average | Measure S.E. Z-Score| MnSq MnSq | |---------------------------------------------------------------------| | Mean (Count: 358) | .01 | -.02 .31 -.12 | .8 .8 | | S.D. | .02 | .06 .21 .26 | .4 .4 | |---------------------------------------------------------------------| | Fixed chi-square: 29.1 d.f.: 358 significance: 1.00 | -----------------------------------------------------------------------  

The results of the bias/interaction analysis for prompt and gender and for prompt and proficiency level are straightforward. They unequivocally show that prompts are not differentially difficult for test takers according to those two characteristics. Note that the results for prompt and language proficiency do require some further discussion. In that analysis, the chi-square test indicates that, overall, bias does not exist. However, in the results for individual combination, three out of 2,103 bias terms had z-scores that were significant. The bias term for the combination of Spanish and Prompt 60 had high infit and outfit measures associated with it, indicating that the observations do not fit the model very well and that other things were affecting the estimate. As such, this particular finding should be discounted. The two “meaningfully” significant bias terms both involve test takers who speak Sinhalese as a first language. Prompt 13 was more difficult than expected, according to the analysis, as indicated by the negative observed-minus-expected value, whereas Prompt 43 was easier than expected.

Page 119: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

111Investigating Prompt Effects in Writing Performance Assessment

These measurements, however, are each based on two ratings; because compositions are always double rated, that means one test taker each.

There are two ways of interpreting the findings. One way of interpreting them would be that the two test-takers’ abilities are typical of their language group, and that the prompts are indeed easier and more difficult, respectively, for Sinhalese speakers. The biases would then apply to all other Sinhalese test-takers in the study. The other way of interpreting the findings would be that the two test-takers’ abilities are not typical of their language group, but as the bias/interaction analysis was conducted based on the measure for their group rather than on their individual measures, apparently significant but spurious results were found. It is difficult to think that the first interpretation is the correct one. If there is something about prompts that makes them biased, what accounts for the observed biases? Why are the observed biases in different directions? And why are the biases not reflected in any of the other 58 prompts? Or among those whose language background and culture are similar to the Sinhalese? The second interpretation is more plausible. Given the results of the chi-square test, given the absence of significant findings in over 2,000 bias terms, and given that the only two significant findings are each based on n-sizes of one, it is more likely that the significant findings are artifacts of estimation based on inadequate samples, and are in fact false. Thus, it would be appropriate to conclude that where prompt and language background is concerned, as with the other two background factors, there is in fact no interaction effect.

In the literature, an interaction is sometimes observed between prompt and the three test-taker background characteristics discussed here (e.g., Breland, et al., 2004; Broer, et al., 2005; Gabrielson, et al., 1995; Lee, et al., 2004). Significant findings usually involved only a few prompts from within their respective pools, and effect sizes were usually small. (On the other hand, there are also studies that show no interaction effect, e.g., Park, 2006). In general, there are a few differences between those studies and the current one, which might contribute to the difference in findings. First, those studies were generally based on stronger assumptions, in that all test takers were matched according to an English language-ability variable. The current study matched a smaller number of test takers under more stringent matching conditions, allowing other test-takers’ abilities to be statistically estimated rather than a priori assumed. Second, the other studies’ interaction analyses were based on residuals after accounting for ability and the variable of interest. The current study’s bias/interaction analyses were conducted on residuals after multiple explanatory variables had been accounted for in the main estimation. There is thus presumably less unexplained variance left for other variables to explain. Finally, the other studies employed logistic regression, and as a result of making stronger assumptions could compare test-taker background characteristics directly. The current study employed multi-faceted Rasch, and as people cannot belong to more than one category for each background characteristics, interaction analysis was done indirectly. That is, the comparison is between the expected score and observed score of, say, a male test taker on that prompt, rather than a comparison between the scores of male and female test takers. Since the difference between observed and expected score of male and female test takers are not added up, the bias presumably appears smaller, and perhaps for that reason goes undetected. Of the three differences between this study and other studies, the first two are reasons for thinking the results of the present study are more dependable, whereas the third is a reason for thinking that the present study underestimated and failed to detect real differences. In any case, on the whole, the present study agrees with others in concluding that much of the differences observed, when they are observed, are not examples of item bias but rather of item impact (Clauser & Mazor, 1998; Penfield & Lam, 2000; Zumbo,

Page 120: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

113112 G. S. Lim

1999). That is, differential probabilities of success are attributable to actual differences in the ability of interest. Conclusion

The questions investigated by this study have to do with the fairness, validity, and reliability of second language writing performance assessments. The possible threat identified by the study is the systematic variation typically built into performance writing tests—in particular, that different test takers have to respond to different prompts, which may or may not be comparable in difficulty. As well, there is a problem when any identifiable group’s scores are affected by factors that have nothing to do with the construct being measured, as these would indicate the presence of test bias.

The results of the study suggest that in second language writing performance assessments such as the MELAB, assigning different prompts to different test takers does not pose a threat to the validity of scores, and that the tests are valid, reliable, and fair in that regard. The study found that differences in prompt difficulty did not generally have an effect on scores. Of the many prompt dimensions and test-taker characteristics investigated, only prompts on social topics appeared to be more difficult to a degree that it possibly made a significant difference in scores, and then by only less than 0.15 of a scale point. Excluding a few outlier prompts was suggested to ensure that scores not be unduly affected by prompt variation in every case. The study demonstrated that varying prompts and still having tests that yield valid scores is possible.

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Breland, H., Bridgeman, B., & Fowles, M. (1999). Writing assessment in admission to higher education: Review and framework. College Board Report, 99-03. Princeton, NJ: Educational Testing Service.

Breland, H., Lee, Y. W., Najarian, M., & Muraki, E. (2004). An analysis of TOEFL CBT writing prompt difficulty and comparability for different gender groups. TOEFL Research Reports, RR-04-05. Princeton, NJ: Educational Testing Service.

Broer, M., Lee, Y. W., Rizavi, S., & Powers, D. (2005). Ensuring the fairness of GRE writing prompts: Assessing differential difficulty. ETS Research Report, RR 05-11. Princeton, NJ: Educational Testing Service.

Brossell, G. (1983). Rhetorical specification in essay examination topics. College English, 45(2), pp. 165–173.

Brossell, G., & Ash, B. H. (1984). An experiment with the wording of essay topics. College Composition and Communication, 35(4), pp. 423–425.

Chiste, K. B., & O’Shea, J. (1988). Patterns of question selection and writing performance of ESL students. TESOL Quarterly, 22, pp. 681–684.

Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17(1), pp. 31–44.

Page 121: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

113Investigating Prompt Effects in Writing Performance Assessment

Connor, U., & Carrell, P. L. (1993). The interpretation of tasks by writers and readers in holistically rated direct assessment of writing. In J. G. Carson & I. Leki (Eds.), Reading in the composition classroom: Second language perspectives (pp. 141–160). Boston, MA: Heinle and Heinle.

Dobson, B. K., Spaan, M. C., & Yamashiro, A. D. (2003, July). What’s so hard about that? Investigating item/task difficulty across two examinations. Poster presented at the Language Testing Research Colloquium, Reading, United Kingdom.

English Language Institute, University of Michigan. (2005). Michigan English language assessment battery: Technical manual 2003. Ann Arbor, MI: English Language Institute, University of Michigan.

Freedman, S. W. (1983). Student characteristics and essay test writing performance. Research in the Teaching of English, 17(4), pp. 313–325.

Gabrielson, S., Gordon, B., & Englehard, G. (1995). The effects of task choice on the quality of writing obtained in a statewide assessment. Applied Measurement in Education, 8(4), pp. 273–290.

Greenberg, K. (1981). The effects of variations in essay questions on the writing performance of CUNY freshmen. New York: The City University of New York Instructional Resource Center.

Hamp-Lyons, L. (1990). Second language writing: Assessment issues. In B. Kroll (Ed.), Second language writing: Research insights for the classroom. Cambridge: Cambridge University Press.

Hamp-Lyons, L., & Mathias, S. P. (1994). Examining expert judgments of task difficulty on essay tests. Journal of Second Language Writing, 3(1), pp. 49–68.

Hinkel, E. (2002). Second language writers’ text: Linguistic and rhetorical features. Mahwah, NJ: Lawrence Erlbaum.

Hoetker, J., & Brossell, G. (1989). The effects of systematic variations in essay topics on the writing performance of college freshmen. College Composition and Communication, 40(4), pp. 414–421.

Jennings, M., Fox, J., Graves, B., & Shohamy, E. (1999). The test-takers’ choice: An investigation of the effect of topic on language-test performance. Language Testing, 16(4), pp. 426–456.

Johnson, J. S., & Lim, G. S. (2009). The influence of rater language background on writing performance assessment. Language Testing, 26(4), pp. 485–505.

Kane, M. T., Crooks, T., & Cohen, A. (1999). Validating measures of performance. Educational Measurement: Issues and Practice, 18(2), pp. 5–17.

Kroll, B., & Reid, J. (1994). Guidelines for designing writing prompts: Clarifications, caveats, and cautions. Journal of Second Language Writing, 3(3), pp. 231–255.

Kunnan, A. J. (Ed.) (2000). Fairness and validation in language assessment: Selected papers from the 19th Language Testing Research Colloquium, Orlando, Florida. Cambridge: Cambridge University Press.

Lee, Y. W., Breland, H., & Muraki, E. (2004). Comparability of TOEFL CBT prompts for different native language groups. TOEFL Research Reports, RR-04-24. Princeton, NJ: Educational Testing Service.

Lewkowicz, J. (1997). Investigating authenticity in language testing. Unpublished doctoral dissertation, University of Lancaster.

Page 122: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

115114 G. S. Lim

Lim, G. S. (2009). Prompt and rater effects in second language writing performance assessment. Unpublished doctoral dissertation, University of Michigan.

Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago: MESA Press. Linacre, J. M. (2002). What do infit, outfit, mean-square and standardized mean? Rasch

Measurement Transactions, 16, p. 878. Linacre, J. M. (2006). Facets Rasch measurement computer program. Chicago: Winsteps.com. Linn, R. L., Baker, E. L., & Dunbar, S. B. (1991). Complex, performance-based assessment:

Expectations and validation criteria. Educational Researcher, 20(2), pp. 15–21. McNamara, T. F. (1996). Measuring second language performance. London: Longman. Messick, S. (1989). Validity. In R. Linn (Ed.), Educational Measurement. New York:

Macmillan. Messick, S. (1994). The interplay of evidence and consequences in the validation of performance

assessments. Educational Researcher, 23(2), pp. 13–23. Messick, S. (1996). Validity and washback in language testing. Language Testing, 13(3), pp.

241–256. Miller, M. D., & Legg, S. M. (1993). Alternative assessment in a high-stakes environment.

Educational Researcher, 12(2), pp. 9–15. Moss, P. (1992). Shifting conceptions of validity in educational measurement: Implications for

performance assessment. Review of Educational Research, 62(3), pp. 229–258. O’Loughlin, K., & Wigglesworth, G. (2007). Investigating task design in academic writing

prompts. In L. Taylor & P. Falvey (Eds.), IELTS collected papers. Research in speaking and writing performance (pp. 379–421). Cambridge: Cambridge University Press.

Park, T. J. (2006). Detecting DIF across different language and gender groups in the MELAB essay test using the logistic regression method. Spaan Fellow Working Papers in Second or Foreign Language Assessment, 4, pp. 81–94.

Park, Y. M. (1988). Academic and ethnic background as factors affecting writing performance. In A. C. Purves (Ed.), Writing across languages and cultures: Issues in contrastive rhetoric (pp. 261–272). Newbury Park, CA: SAGE Publications.

Penfield, R. D., & Lam, T. C. M. (2000). Assessing differential item functioning in performance assessment: Review and recommendations. Educational Measurement: Issues and Practice, 19(3), pp. 5–15.

Polio, C., & Glew, M. (1996). ESL writing assessment prompts: How students choose. Journal of Second Language Writing, 5(1), pp. 35–49.

Powers, D. E., & Fowles, M. E. (1998). Test takers’ judgments about GRE writing test prompts. ETS Research Report 98–36. Princeton, NJ: Educational Testing Service.

Quellmalz, E. S., Capell, F. J., & Chou, C. P. (1982). Effects of discourse and response mode on the measurement of writing competence. Journal of Educational Measurement, 19(4), pp. 241–258.

Reid, J. (1990). Responding to different topic types: A quantitative analysis from a contrastive rhetoric perspective. In B. Kroll (Ed.), Second language writing: Research insights for the classroom (pp. 191-209). Cambridge: Cambridge University Press.

Spaan, M. (1993). The Effect of Prompt in Essay Examinations. In D. Douglas & C. Chapelle (Eds.), A new decade of language testing research (pp. 98–122). Alexandria, VA: TESOL.

Tedick, D. J. (1990). ESL writing assessment: Subject-matter knowledge and its impact on performance. English for Specific Purposes, 9(2), pp. 123–143.

Page 123: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E

115Investigating Prompt Effects in Writing Performance Assessment

Weigle, S. C. (2002). Assessing Writing. Cambridge: Cambridge University Press. Wiseman, C. S. (2009, March). Rater decision-making behaviors in measuring second language

writing ability using holistic and analytic scoring methods. Paper presented at the annual meeting of the American Association for Applied Linguistics, Denver, Colorado.

Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, Ontario, Canada: Directorate of Human Resources Research and Evaluation, Department of National Defense.

Page 124: VOLUME 8 2010 SPAAN FELLOWmichiganassessment.org/wp-content/uploads/2014/12/Spaan... · 2014-12-09 · University of Michigan, English Language Institute • M I C H I G A N • E