at the interface between language testing and second language acquisition: language ability and...

24
http://ltj.sagepub.com/ Language Testing http://ltj.sagepub.com/content/31/1/111 The online version of this article can be found at: DOI: 10.1177/0265532212469177 2014 31: 111 originally published online 21 March 2013 Language Testing Lin Gu Language ability and context of learning At the interface between language testing and second language acquisition: Published by: http://www.sagepublications.com can be found at: Language Testing Additional services and information for http://ltj.sagepub.com/cgi/alerts Email Alerts: http://ltj.sagepub.com/subscriptions Subscriptions: http://www.sagepub.com/journalsReprints.nav Reprints: http://www.sagepub.com/journalsPermissions.nav Permissions: http://ltj.sagepub.com/content/31/1/111.refs.html Citations: What is This? - Mar 21, 2013 OnlineFirst Version of Record - Jan 5, 2014 Version of Record >> at National Dong Hwa University on April 4, 2014 ltj.sagepub.com Downloaded from at National Dong Hwa University on April 4, 2014 ltj.sagepub.com Downloaded from

Upload: l

Post on 23-Dec-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: At the interface between language testing and second language acquisition: Language ability and context of learning

http://ltj.sagepub.com/Language Testing

http://ltj.sagepub.com/content/31/1/111The online version of this article can be found at:

 DOI: 10.1177/0265532212469177

2014 31: 111 originally published online 21 March 2013Language TestingLin Gu

Language ability and context of learningAt the interface between language testing and second language acquisition:

  

Published by:

http://www.sagepublications.com

can be found at:Language TestingAdditional services and information for    

  http://ltj.sagepub.com/cgi/alertsEmail Alerts:

 

http://ltj.sagepub.com/subscriptionsSubscriptions:  

http://www.sagepub.com/journalsReprints.navReprints:  

http://www.sagepub.com/journalsPermissions.navPermissions:  

http://ltj.sagepub.com/content/31/1/111.refs.htmlCitations:  

What is This? 

- Mar 21, 2013OnlineFirst Version of Record  

- Jan 5, 2014Version of Record >>

at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from

Page 2: At the interface between language testing and second language acquisition: Language ability and context of learning

Language Testing2014, Vol 31(1) 111 –133

© The Author(s) 2012Reprints and permissions:

sagepub.co.uk/journalsPermissions.navDOI: 10.1177/0265532212469177

ltj.sagepub.com

At the interface between language testing and second language acquisition: Language ability and context of learning

Lin GuEducational Testing Service, USA

AbstractThis study investigated the relationship between latent components of academic English language ability and test takers’ study-abroad and classroom learning experiences through a structural equation modeling approach in the context of TOEFL iBT® testing. Data from the TOEFL iBT public dataset were used. The results showed that test takers’ performance on the test’s four skill sections, namely listening, reading, writing, and speaking, could be accounted for by two correlated latent components: the ability to listen, read, and write, and the ability to speak English. This two-factor model held equivalently across two groups of test takers, with one group having been exposed to an English-speaking environment and the other without such experience. Imposing a mean structure on the factor model led to the finding that the groups did not differ in terms of their standings on the factor means. The relationship between learning contexts and the latent ability components was further examined in structural regression models. The results of this study suggested an alternative characterization of the ability construct of the TOEFL test-taking population, and supported the comparability of the language ability developed in the home-country and the study-abroad groups. The results also shed light on the impact of studying abroad and home-country learning on language ability development.

KeywordsLanguage ability, latent mean comparison, multi-group invariance analysis, structural equation modeling, target language contact

In a review of language testing and assessment, Alderson and Banerjee (2002) stated that ‘an understanding of what language is, and what it takes to learn and use language’ is central to language testing (p. 80). Bachman (2000) asserted that ‘current thinking in applied linguistics about the nature of language ability and language use’ has guided the development and refinement of new tests (p. 2). Together these researchers highlighted

Corresponding author:Lin Gu, Educational Testing Service, 660 Rosedale Road, MS 04-R, Princeton, NJ 08541, USA. Email: [email protected]

469177 LTJ31110.1177/0265532212469177Language TestingGu2012

Article

at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from

Page 3: At the interface between language testing and second language acquisition: Language ability and context of learning

112 Language Testing 31(1)

the importance of examining and understanding the construct of language ability. Such inquiries entail inspecting whether an empirically derived representation of the construct based on test performance is compatible with the proposed theoretical configuration of the construct the test intends to measure. The testing field’s understanding of language ability can be advanced as results of research attempts of this kind, over multiple tests and under different testing situations, become available.

As studying the nature of language ability is an important area of research in language testing, the field has witnessed a profusion of language ability models of diverse natures. A unitary competence model, which views language ability as indivisible, was endorsed by Oller, among others (e.g. Oller, 1979). Carroll (1965), by contrast, proposed a four-skills approach to conceptualize language ability based on the assumption that the four skills of listening, reading, speaking, and writing are distinguishable areas of perfor-mance. Empirical researchers have shown that the nature of language ability consists of multiple components. Speaking and reading were found to be two distinct factors in Bachman and Palmer’s (1983) study. Sang, Schmitz, Vollmer, Baumert, and Roeder (1986) found three component factors of language ability: basic elements of knowledge, integration of basic knowledge elements, and interactive use of language. A higher-order model with three first-order factors, oral–aural, structure–reading, and discourse, repre-sented the nature of language ability in Fouly, Bachman, and Cziko (1990). Bachman, Davidson, Ryan, and Choi (1995) identified a higher-order model with speaking, listen-ing, and test-specific writing skills as distinct first-order ability factors. Both Buck (1992) and Bae and Bachman (1998) demonstrated that the two receptive skills, listening and reading, were factorially different. The components of language ability found in Sasaki (1993) included writing, comprehending short context, and comprehending long context.

The concept of a multidimensional language ability has been well received based on both theoretical and empirical grounds (Kunnan, 1998), although the research commu-nity has not yet reached an agreement regarding the nature of the constituents, or on the manner in which they interact (Chalhoub-Deville, 1997). Wolf et al. (2008), observing the field’s uncertainty, stated that no consensus has been found across the tests in the definition of language ability. Sawaki, Stricker, and Oranje (2008) also pointed out that multidimensional competence models come in different forms, varying in terms of the exact factor structures identified. In summary, researchers have reached a consensus that language ability is a complex construct with multiple dimensions. However, it is still unclear what this ability consists of or what the relations among the constituent parts are.

The generality (or the relativity) of language ability has also been a focus of investiga-tion in language testing. Messick (1989) warned against taking the generalizability of a construct meaning across various contexts (e.g. population groups, situations or settings, times, tasks, etc.) for granted. He proposed that context effects, especially different pop-ulation groups, in score interpretation be systematically appraised in educational meas-urement. The idea of a universally applicable construct framework seems especially questionable in language testing, considering the usually heterogeneous nature of the test-taking population. Test taker characteristics are recognized as one of the four influ-ences on test scores in Bachman’s (1990) Communicative Language Ability model, and in Bachman and Palmer’s (1996) description of language use in language tests. Earlier

at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from

Page 4: At the interface between language testing and second language acquisition: Language ability and context of learning

Gu 113

researchers called for interpreting the nature of language ability in light of learner vari-ability (Harley, Cummins, Swain & Allen, 1990; Kunnan, 1998). In a more recent review of English language testing and assessment, Alderson and Banerjee (2002) restated the importance of understanding the characteristics of test takers and how these characteris-tics interact with their abilities measured by a test.

In response to this growing interest, the field has observed a surge of empirical studies that investigated whether the nature of language ability varies as a function of different background characteristics, such as proficiency level (Kunnan, 1992; Römhild, 2008; Shin, 2005), native language background (Shin, 2005; Stricker & Rock, 2008; Stricker, Rock & Lee, 2005; Swinton & Powers, 1980), cognitive skill (Sang, Schmitz, Vollmer, Baumert & Roeder, 1986), gender (Wang, 2006), and length of formal instruction (Stricker & Rock, 2008). The general view is that language ability can be interpreted more meaningfully if relevant test taker characteristics are taken into consideration. However, there are still characteristics that have not attracted enough attention, and therefore their relationships with test performance remain under-researched. One such characteristic is target language contact. Language contact is a concept developed by study-abroad researchers. It specifies the nature and intensity of language learners’ out-of-class contact with the target language. Only a handful of studies investigated the potential impact of target language contact on test performance (Bae & Bachman, 1998; Ginther & Stevens, 1998; Morgan & Mazzeo, 1988; Stricker & Rock, 2008). These stud-ies and their results are discussed below.

In the context of Advanced Placement French Language Examination, Morgan and Mazzeo (1988) identified four groups of examinees with varying out-of-school contact experience with the French language. The first group, the standard group, had little or no out-of-school French language experience. The second group had spent at least one month in a French-speaking country. The third group regularly spoke or listened to French at home. The fourth group was college students who had no out-of-school experi-ence. A correlated four-factor model consisting of listening and writing, language struc-ture, reading, and speaking provided the best overall fit to the data. Invariance analyses were conducted between the standard group and each of the other three groups. The results showed that the more target language contact experience a group had, the more that group’s performance deviated from that of the standard group. The factor structure based on the standard group was most similar to the one of the college group, with both groups lacking significant out-of-school French language experience as a common char-acteristic. The results indicated that out-of-school target language experience was associ-ated with the factorial structure of French language ability measured by the test.

Target language contact in an at-home environment was examined in relation to lis-tening and reading performance on a Korean language test in Bae and Bachman (1998). Two groups of learners were included: the heritage learners who were likely immersed in Korean at home, and the non-heritage learners who would use a language other than Korean at home. Although a two-factor structure with listening and reading was accepted for both groups in terms of global model fit, differences in individual parameter esti-mates indicated that some parts of the test functioned differently for learners in different groups. Compared to the non-heritage learners, the heritage learners performed more uniformly on the listening tasks. By contrast, the non-heritage learners demonstrated less

at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from

Page 5: At the interface between language testing and second language acquisition: Language ability and context of learning

114 Language Testing 31(1)

variance in reading than the heritage learners. In sum, the study demonstrated that lan-guage ability developed by test takers of different heritage language backgrounds could display different underlying structures.

Ginther and Stevens (1998) explored how out-of-classroom target language contact, as indicated by ethnicity and preferred language, interacted with test performance on the Advanced Placement Spanish Language Examination. Group comparisons were con-ducted between the reference group, Latin Spanish-speakers, and each of the four com-parison groups: Mexican Spanish-speakers, Mexican bilingual-speakers, White English-speakers, and Black English-speakers. The study found that the more a group’s background in ethnicity and preferred language deviated from the one of the reference group, the less equivalent the factor structures were. This finding indicated the possible influence of language exposure outside of formal instructional settings on differences in factor structure underlying test performance.

Results from the above studies suggested that target language contact moderated test performance. In other words, language abilities developed in groups with different target language contact experiences were different in terms of latent structure. On the contrary, invariance of the factorial structure of language ability was confirmed across subgroups of test takers in the context of TOEFL iBT testing (Stricker & Rock, 2008). In this study, test takers were identified by, among other criteria, degree of exposure to English use in educational and business in an English as a Foreign Language (EFL) setting. Kachru’s (1984) classification of inner-circle countries where English is the primary language, outer-circle countries where English has special administrative status, and expanding-circle countries where English is considered important but has no special administrative status was adopted. Focusing on English as a foreign language, test takers from the inner-circle countries were not included in the study. The remaining test takers were divided into either the outer-circle country group or the expanding-circle country group. A higher-order model with four first-order factors was found to be the best-fitting model. The same structure was also identified across the two groups. The results supported com-parability of language ability developed in test taker groups with different degrees of English language exposure in an EFL context.

Among the studies cited above, Morgan and Mazzeo (1988) was the only one that investigated language contact in a study-abroad context. With growing opportunities for studying abroad, language learning has expanded from traditional classroom settings to community-embedded settings. It has become imperative to understand how different context-of-learning variables, including both classroom-based learning and community-embedded learning, interact with test takers’ ability profiles. Researchers in second lan-guage acquisition have long been interested in how language performance differs in relation to different learning environments. After reviewing studies comparing learning outcomes from study-abroad and home-country classroom learning, Collentine and Freed (2004) found no convincing evidence that one learning context was of absolute superiority compared to the other. Depending on the aspects of linguistic development and levels of proficiency, one learning context might produce more gains than the other. Built upon what has been found in both fields of language testing and second language acquisition, this study applies a structural equation modeling (SEM) approach to inves-tigate how differences in learning contexts and experiences may interact with the nature

at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from

Page 6: At the interface between language testing and second language acquisition: Language ability and context of learning

Gu 115

and development of language ability. In a SEM paradigm, relationships can be modeled and tested among not only observed but also latent variables. Furthermore, mean differ-ences among groups at the latent construct level, rather than the observed variable level, can be examined. Since SEM permits a consideration of measurement errors by means of estimating error variances in the model, estimates of group differences at the latent construct level are more accurate than the ones estimated at the observed variable level. None of the studies reviewed incorporated a mean structure. Bae and Bachman (1998) called for using a mean structure analysis approach for examining latent group mean dif-ferences as a suggestion for future researchers.

By adopting the SEM approach, this study intended to investigate the relationships between language ability and context-of-learning experiences. Situated in the context of the TOEFL iBT testing, this study focused on the academic English ability the test claims to measure, and on this ability’s relationships with test takers’ target language contact and classroom learning experiences. Three research questions were put forward as follows:

1. Does the academic language ability developed in two groups of English language learners, either having lived in an English-speaking environment (the study-abroad group) or not having done so (the home-country group), have equivalent factorial representations?

2. Do the two groups of test takers, the study-abroad group and the home-country group, differ in terms of means on the latent components of the academic English ability?

3. Is there any association between the length of study-abroad, when examined together with classroom learning experiences, and the latent components of the academic English ability?

Method

Study sample

TOEFL iBT public dataset Form A and its associated test performance from 1000 test takers were used for this study. The 1000 test takers were asked to provide information about, among others, the amount of time they spent studying English, the amount of time they spent in content classes taught in English, and the amount of time they spent living in an English-speaking country (see Appendix A for a full list of test taker background variables). Among the sample of 1000, 370 test takers answered all three aforementioned questions presented in a multiple-choice format. Regarding the time spent studying English the majority of the 370 test takers (about 64%) reported that they had studied English for at least 5 years by the time they took the test. A third of them had studied English for 10 years or more at the time of testing. In terms of the length of taking con-tent classes taught in English, about a third of the 370 test takers indicated that they had never had such experience. Close to 60% of them reported that they had at least one year of such experience. With regard to the time living in an English-speaking country, about two-thirds of the 370 test takers indicated that they had lived in an English-speaking

at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from

Page 7: At the interface between language testing and second language acquisition: Language ability and context of learning

116 Language Testing 31(1)

country for various periods of time upon test taking. These 370 test takers constituted the study sample based on which the study’s research questions were addressed.1

Measures and scoring

The test form had four skill sections: listening, reading, speaking, and writing. The TOEFL iBT test reports scaled scores that range from 0 to 30 for each section, and a total score that is the sum of the scaled scores for the four sections. The raw scores were used for the analysis in this study.

The listening section had six tasks. Each listening task had a prompt followed by five or six selected response questions. There were 34 dichotomously scored items in the listening section. The total possible raw score points for the listening section was 34. The reading section had three tasks. Each reading task had a prompt followed by 12 to 14 selected response questions. There were 41 items in total in the reading section. Thirty-eight of them, worth one point each, were dichotomously scored. Three items were poly-tomously scored, worth either two or three points. The total possible raw score points for the reading section was 45. The speaking section contained six tasks. The first two tasks asked test takers to provide oral responses to a written prompt. These tasks were consid-ered to be independent because required responses were not dependent on any informa-tion provided through other channels during the test. The other four were integrated speaking tasks. These tasks required test takers to provide oral responses based on the information they received through listening or reading or both channels. Each task was rated on a 0–4 holistic scale at a one-point interval. The total possible raw score points for the speaking section was 24. The writing section consisted of two tasks. The first task was an integrated task that required test takers to provide written responses based on the information they received through listening and reading. The second one, an independent task, asked test takers to write in response to a written prompt. Each task was rated on a 0–5 holistic scale at half-point intervals. The total possible raw score points for the writ-ing section was 10.

Analysis procedures

Analyses were performed by using Mplus (Muthén & Muthén, 2010). Task scores were the level of measure. For the listening and reading sections, a task score was the total score summed across a set of items based on a common prompt. Six observed listening variables (L1–L6) and three observed reading variables (R1–R3) were therefore obtained. A task score in the writing and speaking sections was simply the score assigned for a task. There were six observed speaking variables (S1–S6) and two observed writing vari-ables (W1–W2). Each task score was treated as a continuous variable. There were 17 total observed variables in the study. Using task scores instead of item scores allowed all variables to be treated as continuous. Using this level of measure, as Stricker and Rock (2008) suggested, would also help alleviate the problem caused by the dependence among items associated with a common prompt.

The parceling technique of aggregating items based on a common prompt has been used in a couple of factor analytic studies in language testing (Kunnan, 1992; Stricker

at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from

Page 8: At the interface between language testing and second language acquisition: Language ability and context of learning

Gu 117

& Rock, 2008; Stricker et al., 2005). A common prerequisite for parceling is to estab-lish unidimensionality of the items to be parceled (Bandalos, 2002; Bandalos & Finney, 2001; Little, Cunningham, Shahar, & Widaman, 2002). Results from explora-tory factor analyses performed on items within the listening and reading section respectively indicated that unidimensionality could be held within each modality. By satisfying the prerequisite of establishing unidimensionality, parceling was warranted in this study.

Assumptions regarding univariate and multivariate normality were inspected. Univariate normality was checked by examining the skewness and kurtosis indices, and by examining the plots of score distributions. Multivariate normality was evaluated based on the results of univariate normality inspection, as Kline (2005) suggested. The distribution of the values was examined so that an informed decision could be made regarding the choice of an appropriate estimation method.

In all models the unstandardized loading of the first indicator of a factor was fixed to one to identify the model. Three competing models were fitted to the data in determining which model would better represent the internal structure of the test. The operational TOEFL iBT test uses a four-skills approach to test design, and it reports a separate score on each of the skills (listening, reading, writing, and speaking), as well as a total score. Results from previous factor-analytic studies (Sawaki et al., 2008; Stricker & Rock, 2008; Stricker et al., 2005) showed that the relationships among the four skills could be captured by three competing models. A higher-order model (Figure 1) consists of four independent skill factors, namely Listening (L), Reading (R), Speaking (S), and Writing (W), conditional on a general language ability factor (G). In other words, the correlations among the four skills are explained by G. This model corresponds to the test’s scoring scheme, which reports four skill scores and a total score. In a four-factor model (Figure 2), the four skill factors correlate with one another. This model corresponds to the section structure of the test. A correlated two-factor model (Figure 3) has a speaking factor and a factor associated with listening, reading, and writing (L/R/W). This model, although not compatible with the four-skill section design or the score reporting method, was confirmed and chosen over other competing models to account for test performance on the LanguEdge™ test, a prototype of the TOEFL iBT test, by Stricker et al. (2005). The authors attributed this finding to the lack of emphasis on the development of speaking in English language instruction. All three competing models were specified as a priori and tested for fit based on the performance of 1000 test takers. The best-fitting model was adopted as the factor model for the entire group.

The adequacy and appropriateness of models were evaluated based on three criteria: (1) values of selected global model fit indices; (2) individual parameter estimates; and (3) the principle of parsimony. The selection of model fit indices used in this study was based on Kline’s (2005) suggestions. To assess global model fit, the following indices were used: chi-square test of model fit (χ2), comparative fit index (CFI), root mean square error of approximation (RMSEA), and standardized root mean square residual (SRMR). A significant χ2 value indicates a bad model fit, although this value should be interpreted with caution because it is highly sensitive to sample size. Suggested by Hu and Bentler (1999), a CFI value larger than 0.90 shows the specified model has a reasonably good fit. RMSEA smaller than 0.05 can be interpreted as a sign of good model fit while values

at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from

Page 9: At the interface between language testing and second language acquisition: Language ability and context of learning

118 Language Testing 31(1)

between 0.05 and 0.08 indicate reasonable approximation of error (Browne & Cudeck, 1993). A SRMR value of 0.08 or below is commonly considered as a sign of acceptable fit (Hu & Bentler, 1999). Individual parameter estimates were also examined for appro-priateness and significance. Previous researchers (Sawaki et al., 2008; Stricker & Rock, 2008; Stricker et al., 2005) used a correlation of 0.90 to detect extremely high correla-tions among factors. This criterion was adopted to screen out models with extremely high latent factor correlations. The principle of parsimony favors a simpler model over a more saturated one if the two models fit equivalently. This principle was implemented when choosing between competing models with similar fits.

Among the 370 test takers who responded to the relevant background questions, two groups of test takers with different target language contact experiences were identified. Group I, consisting of 124 test takers, had never lived in an English language environ-ment (the home-country group). Group II, the study-abroad group, comprising 246 test takers, had lived in an English-speaking country for various lengths of time, ranging from less than six months to more than one year.

Figure 1. Higher-order factor model.

at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from

Page 10: At the interface between language testing and second language acquisition: Language ability and context of learning

Gu 119

To address the first research question, measurement invariance analyses were con-ducted. The best-fitting factor model for the entire sample was imposed on both groups simultaneously to test configural invariance first. It was followed by imposing equality control on factor loadings to evaluate metric invariance. The measurement component of a model defines the meanings of the factors by specifying what their indicators are and how they are related to their respective indicators. Establishing measurement invariance is a prerequisite for making any meaningful mean comparisons across groups (Vandenberg & Lance, 2000).

To address the second research question, invariance analyses with a mean structure were carried out. Additional equality control across the groups was imposed first on indicator intercepts to test scalar invariance, and then on factor means to examine any differences in latent means that may be considered negligible.

Simultaneous multi-group invariance analyses were carried out with parameters con-strained to equality across the groups in the hierarchical fashion explained above. Nested models were compared by evaluating chi-square differences (Δχ2) and changes in CFI (ΔCFI). Supplementing the Δχ2 test, which is an exact test of fit, changes in CFI that assess approximate fit was recommended (Steenkamp & Baumgartar, 1998). A significant Δχ2

Figure 2. Correlated four-factor model.

at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from

Page 11: At the interface between language testing and second language acquisition: Language ability and context of learning

120 Language Testing 31(1)

result would indicate that choosing a simpler model over a more saturated model could not be justified. A non-significant test result would suggest choosing the more constrained model based on the principle of parsimony. A ΔCFI less than or equal to 0.01 indicates invariance is supported (Cheung & Rensvold, 2002).

To respond to the third research question, a structural regression model was built for the home-country group and the study-abroad group respectively. With the home-country group, two independent variables – the time spent studying English (‘Study’) and the time spent in content classes taught in English (‘Content’) – were modeled to have direct effects on the language ability. Although both types of learning most likely occur in classroom settings, the latter might be more conducive to the development of academic English skills. With the study-abroad group, three independent variables were modeled to have direct effects on the language ability: the ‘Study’ variable, the ‘Content’ variable, and the length of living in an English-speaking country (‘Live’).

Figure 3. Correlated two-factor model.

at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from

Page 12: At the interface between language testing and second language acquisition: Language ability and context of learning

Gu 121

Table 1. Descriptive statistics for the observed variables (N = 1000).

Variable Range Mean Std. Dev. Kurtosis Skewness

Listening Task One (L1) 0−5 3.23 1.22 −0.40 −0.38Listening Task Two (L2) 0−6 3.46 1.39 −0.54 −0.13Listening Task Three (L3) 0−6 2.85 1.57 −0.70 0.15Listening Task Four (L4) 0−5 4.32 1.02 2.61 −1.69Listening Task Five (L5) 0−6 4.21 1.37 −0.27 −0.59Listening Task Six (L6) 0−6 4.61 1.50 0.40 −1.07Reading Task One (R1) 0−15 6.71 2.72 −0.23 0.33Reading Task Two (R2) 0−15 9.82 3.23 −0.73 −0.37Reading Task Three (R3) 0−15 9.70 3.26 −0.86 −0.17Speaking Task One (S1) 0−4 2.50 0.76 −0.29 0.02Speaking Task Two (S2) 0−4 2.58 0.78 −0.47 0.11Speaking Task Three (S3) 0−4 2.50 0.77 −0.10 −0.02Speaking Task Four (S4) 0−4 2.42 0.84 −0.08 −0.10Speaking Task Five (S5) 0−4 2.55 0.79 0.08 −0.10Speaking Task Six (S6) 0−4 2.53 0.85 −0.07 −0.24Writing Task One (W1) 0−5 3.11 1.22 −0.99 −0.20Writing Task Two (W2) 0−5 3.41 0.86 −0.34 −0.04

Results

Preliminary analyses

Table 1 summarizes the descriptive statistics for the observed variables, including pos-sible score range, mean, standard deviation, kurtosis, and skewness. Variable L4 had a skewness value larger than two. The histograms of all the variables revealed that the distributions of Variables L4 and L6 exhibited a ceiling effect. Univariate normality could not be held in these two cases, indicating that the distribution of this set of varia-bles could deviate from multivariate normality. A corrected normal theory estimation method, the Satorra–Bentler estimation (Satorra & Bentler, 1994), was employed by using the MLM estimator in Mplus (Muthén & Muthén, 2010) to correct global fit indi-ces and standard errors for non-normality.

No violation of linearity was found by examining scatter plots of all possible pairs of the variables. Pairwise multicollinearity was checked by inspecting the correlation matrix of the variables. As shown in Table 2, dependence among all pairs of variables was moderate (0.32–0.69). No extremely high value of correlation coefficient was found.

To establish the best-fitting model, all three previously confirmed models were fitted to the data for comparisons. As summarized in Table 3, the selected global fit indices for the three models were all satisfactory except that the chi-square values were all statisti-cally significant (p = 0.00). At the global level all three competing models demonstrated a reasonable fit to the data.

Next, individual parameter estimates were examined. The results of testing the higher-order model showed that the loadings of the L and W factors on the G factor were as high as 0.98 and 1.00 respectively. Regarding the correlated four-factor model the correlation

at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from

Page 13: At the interface between language testing and second language acquisition: Language ability and context of learning

122 Language Testing 31(1)

between the L factor and W factor was estimated as being as high as 0.97. Such high levels of correlation suggested that the factors were not distinct enough to be considered as separate factors. These two models with extremely high latent factor correlations were therefore considered inadmissible.

Standardized parameter estimates for the two-factor model are shown in Figure 4. A factor loading presents the extent to which an indicator reflects, or measures the latent factor. All factor loadings were significant (p < 0.01). A standardized measurement error represents the percentage of the variance of an indicator that is not explained by its fac-tor. The estimated correlation between the two factors did not exceed 0.90. The results from model estimation satisfied all criteria. According to this model, the language ability construct had two latent components. Tasks from the listening, reading, and writing sec-tions all loaded on a common factor (F1), the non-speaking factor. In other words, these tasks were the indicators of the presumed latent ability to listen, read, and write. The six speaking variables all loaded on a second common factor (F2), the speaking factor. Put differently, the six speaking tasks were the indicators of the presumed speaking ability.

Table 3. Fit indices for the three competing models (N = 1000).

χ2S−B df CFI RMSEA SRMR

Correlated two-factor model 530.73 118 0.96 0.06 0.04Higher-order factor model 363.24 115 0.97 0.05 0.03Correlated four-factor model 282.16 113 0.98 0.04 0.03

Table 2. Correlations of the observed variables (N = 1000).

L1 L2 L3 L4 L5 L6 R1 R2 R3 S1 S2 S3 S4 S5 S6 W1 W2

L1 1.00 L2 0.36 1.00 L3 0.42 0.48 1.00 L4 0.41 0.34 0.35 1.00 L5 0.43 0.43 0.48 0.46 1.00 L6 0.47 0.42 0.48 0.54 0.55 1.00 R1 0.35 0.43 0.49 0.34 0.43 0.42 1.00 R2 0.39 0.49 0.51 0.43 0.52 0.56 0.56 1.00 R3 0.42 0.47 0.53 0.42 0.50 0.58 0.58 0.69 1.00 S1 0.41 0.39 0.39 0.39 0.41 0.45 0.36 0.35 0.39 1.00 S2 0.36 0.36 0.36 0.41 0.35 0.41 0.32 0.35 0.36 0.55 1.00 S3 0.43 0.37 0.39 0.45 0.42 0.45 0.34 0.37 0.38 0.55 0.57 1.00 S4 0.41 0.37 0.44 0.44 0.42 0.51 0.34 0.40 0.45 0.56 0.55 0.59 1.00 S5 0.43 0.39 0.41 0.44 0.41 0.44 0.37 0.40 0.42 0.54 0.56 0.56 0.59 1.00 S6 0.44 0.38 0.43 0.48 0.46 0.52 0.36 0.39 0.41 0.57 0.59 0.60 0.61 0.63 1.00 W1 0.51 0.47 0.54 0.50 0.55 0.62 0.53 0.62 0.62 0.47 0.46 0.51 0.51 0.53 0.55 1.00 W2 0.49 0.48 0.47 0.48 0.49 0.55 0.48 0.54 0.57 0.52 0.55 0.53 0.55 0.55 0.55 0.64 1.00

at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from

Page 14: At the interface between language testing and second language acquisition: Language ability and context of learning

Gu 123

Figure 4. Correlated two-factor model with standardized estimates.

Accordingly the correlated two-factor model was adopted as the factor model for the entire group. Both components were skill-based, and together they adequately accounted for the performance on the test for the whole group.

Multi-group invariance analysis

To examine whether the academic language ability measured by the test had equivalent factorial representations across the home-country (N = 124) and study-abroad (N = 246) groups, both configural and metric invariances were tested. The two-factor structure obtained, based on the sample of 1000 test takers, was first imposed on the test perfor-mance of the 370 test takers combined across the two groups to test model fit. The

at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from

Page 15: At the interface between language testing and second language acquisition: Language ability and context of learning

124 Language Testing 31(1)

resulting model fit indices (χ2S−B = 268.49, df = 118; CFI = 0.95; RMSEA = 0.06; SRMR

= 0.04) indicated that the model fit reasonably well to the subsample.In testing configural invariance, the obtained two-factor model was imposed on both

groups simultaneously, but parameter estimates were allowed to differ across the groups. The model fit the data well at the global level, as shown in Table 4. An examination of the parameter estimates showed that factor loadings were all significant, and no factor correlation exceeded 0.90. The results showed that configural invariance could be held, indicating that the performance in both groups could be accounted for by the same two factors, and the same task variables were associated with the same latent ability components.

The next step in the sequence involved constraining the unstandardized factor loading of the same observed variables to be equal across the groups to test metric invariance. Other parameters were allowed to differ across the groups. As displayed in Table 4, the model fit well at the global level. Parameter estimates were all appropriate.

The metric invariance model was nested within the configural invariance model. The Satorra–Bentler chi-square difference (Δχ2

S−B) test was conducted to find out if the more constrained model fit similarly well as the less constrained model. The result was not significant (Δχ2

S−B = 13.87, Δdf = 15, p = 0.54). The change in CFI was less than 0.01 (ΔCFI = 0.00). The results indicated that imposing additional constraints on the factor loadings did not make the model fit deteriorate badly enough to justify adopting the more saturated model. The model with equal factor loadings across the groups was therefore chosen based on the principle of parsimony. According to this model, the strength of relationships between indicators and their underlying constructs were the same across the groups. In other words, the indicators measured their respective latent factors in an equivalent way. This outcome implied that the latent ability components were manifested in the same way, and the ability construct measured by the test had equivalent factorial representations in both groups.

In testing the configural and metric invariances the mean structure was ignored by fixing all factor means to be zero. To examine whether the two groups of test takers, the home-country group and the study-abroad group, differed in terms of means on the latent ability components, invariance analyses proceeded with a mean structure imposed.

More constraints were imposed on the metric invariance model. First, the unstandard-ized indicator intercepts were held to be equal across the groups in testing scalar invari-ance. An intercept represents the value of an observed variable for a test taker with zero on the latent construct. Table 4 shows that the model fit reasonably well at the global level. All parameter estimates were appropriate. The Δχ2

S−B between the scalar and the

Table 4. Multi-group measurement invariance analysis.

Invariance χ2S−B df CFI RMSEA SRMR Δχ2

S−B Δdf p ΔCFI

Configural 404.01 236 0.95 0.06 0.05 Metric 417.69 251 0.95 0.06 0.06 13.87 15 0.54 0.00Scalar 456.49 266 0.94 0.06 0.06 38.80 15 0.00 0.01Mean 460.65 268 0.94 0.06 0.06 4.27 2 0.12 0.00

at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from

Page 16: At the interface between language testing and second language acquisition: Language ability and context of learning

Gu 125

metric invariance model was significant (Δχ2S−B = 38.80, Δdf = 15, p = 0.00), indicating

that not all intercepts could be held equal. However, the change in CFI was no larger than 0.01(ΔCFI = 0.01), which supported intercept invariance. No change in RMSEA or SRMR was found, signaling that there was no substantial deterioration in model fit. An examination of the residuals showed that the residuals associated with the constrained intercepts were all small against the metric of the variables. Small residuals meant that any misfit due to the constraints was negligible. These results indicated that intercept invariance could be held, meaning that individuals at the same latent trait level of zero, but belonging to different groups, had similar means on the observed variables. The out-come of the scalar invariance analysis implied that any differences in the means of the observed variables were due to differences in the means of the latent factors.

When testing scalar invariance, the latent factor means in Group I were fixed to zero, and the means in Group II were free to be estimated. The estimated means in Group II were essentially the mean differences between the groups at the latent construct level. The estimated unstandardized mean difference for the first factor was −0.09 (p = 0.23), indicating that the study-abroad group performed worse, but not significantly worse, than the home-country group on the items associated with the ability to listen, read, and write English. The estimated unstandardized mean difference between the groups for the second factor was 0.02 (p = 0.77), showing that the study-abroad group performed better, but not significantly better, than the home-country group on the items associated with the ability to speak. In a statistical sense the two groups did not differ in terms of mean on either latent factor.

After establishing the scalar invariance between the home-country and study-abroad groups, an analysis of the invariance of group means at the latent construct level was conducted next. Equality constraints were imposed on the latent factor means by fixing all factor means in both groups to zero. Except for the model chi-square (Table 4), all model fit indices were satisfactory. All parameter estimates were appropriate. The results indicated that factor mean invariance could be held. The Δχ2

S−B test between this model and the preceding model was not significant (Δχ2

S−B = 4.27, Δdf = 2, p = 0.12). The change in CFI was less than 0.01 (ΔCFI = 0.00). The current model with constraints on factor means was then adopted, implying that the two groups were equivalent in terms of latent construct means.

The outcome of the multi-group invariance analyses showed that the ability construct measured by the test had equivalent factorial representations across the two groups with different target language contact experiences. It further demonstrated that the groups did not differ in terms of their standings on the latent construct either. In conclusion, the moderating effect of the group membership on test performance was minimal.

Structural regression models

To investigate the third research question, a unique structural regression model was built and estimated for each group. Both models were found to fit the data well, as the global model fit indices in Table 5 indicate.

For the home-country group, the ‘Study’ and ‘Content’ variables were modeled to have direct effects on the latent ability components. The standardized parameter

at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from

Page 17: At the interface between language testing and second language acquisition: Language ability and context of learning

126 Language Testing 31(1)

estimates displayed in Figure 5 showed that the path coefficients from ‘Content’ to both factors were significantly different from zero (p < 0.01). The path coefficient between ‘Content’ and the non-speaking factor was 0.23, indicating that one standard deviation of change in the length of taking content classes taught in English was associated with 0.23

Figure 5. Structural regression model home-country group with standardized estimates.

Table 5. Structural regression models.

χ2S−B df CFI RMSEA SRMR

Group I 256.62 148 0.92 0.08 0.06Group II 285.10 163 0.95 0.06 0.05

at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from

Page 18: At the interface between language testing and second language acquisition: Language ability and context of learning

Gu 127

standard deviation of change in the non-speaking latent factor. The ‘Study’ variable was significantly associated with the speaking factor (p < 0.01) but not the non-speaking fac-tor, meaning that the change of latent speaking ability was associated with the length of studying English whereas the change of non-speaking ability was not associated with the time spent studying English.

The standardized regression residual variances of the two factors, as represented in the figures as D1 and D2, were 0.92 and 0.77 respectively, which meant that 92% of the variance of the non-speaking factor and 76% of the variance of the speaking factor could not be explained by the two independent variables. The two residuals were correlated at 0.79. Both standardized residuals, especially the one for the non-speaking factor, were very high. This indicated that large portions of the factor variances were not explained by the independent variables the latent factors were regressed on.

For the study-abroad group, three independent variables, ‘Study’, ‘Content’, and ‘Live’, were modeled to have direct effects on the latent factors. The standardized param-eter estimates shown in Figure 6 indicated that the ‘Study’ variable was significantly associated with both ability components (p < 0.01). The ‘Content’ variable had no sig-nificant relationships with either of the factors. The ‘Live’ variable had significant asso-ciations with the speaking factor (p < 0.01) and the non-speaking factor (p < 0.05), although the associations were not as strong as the ones between the ‘Study’ variable and the latent ability components.

The standardized residual variances for the factors were high, 0.82 for the non-speaking factor and 0.80 for the speaking factor. They were correlated at 0.77. This indicated that large portions of the factor variances remained unexplained in the models.

Discussion and implications

This study found that the academic English ability as measured by the TOEFL iBT test had two latent components: the speaking ability, and the ability to listen, read, and write. In other words, listening, reading, and writing share a common latent factor statistically. The finding of a two-factor model was consistent with the consensus on the multi- component nature of language ability reached by previous researchers. This two-factor structure, however, differed from the factor solution suggested by the TOEFL iBT test design or the scoring scheme. One possible explanation for the distinctiveness between a speaking factor and a non-speaking factor could be instruction or lack thereof. The speaking section became mandatory with the introduction of the TOEFL iBT test, whereas listening, reading, and writing had long been required as part of the TOEFL test-ing routine before the TOEFL iBT test. The emergence of the two-factor structure could be, as Stricker et al. (2005) suggested, a reflection of English language instruction prac-tice which places more emphasis on the training of the listening, reading, and writing skills than on the development of oral proficiency. In sum, a lack of test preparation and language training for speaking could have contributed to the identification of a speaking ability that was different from listening, reading, and writing combined. This two-factor model suggested an alternative characterization of the ability construct of the TOEFL test-taking population. However, it should be pointed out that the two-factor structure based on the sample used in this study was only one of the plausible factorial solutions that could be used to account for performance on the TOEFL iBT test.

at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from

Page 19: At the interface between language testing and second language acquisition: Language ability and context of learning

128 Language Testing 31(1)

The study did not find convincing evidence to claim that test takers with study-abroad experience as a whole group performed differently on the test from the ones without such experience, in terms of factorial representation or latent construct means. The language ability developed in the home-country group appeared to be factorially similar to the one in the study-abroad group. Both groups possessed a distinct speaking ability indicated by their responses to the speaking tasks. They also exhibited an ability that could be cap-tured by their responses to the listening, reading, and writing tasks.

This outcome did not agree with what some of the previous studies found (Bae & Bachman, 1998; Ginther & Stevens, 1998; Morgan & Mazzeo, 1988). Results from these

Figure 6. Structural regression model study-abroad group with standardized estimates.

at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from

Page 20: At the interface between language testing and second language acquisition: Language ability and context of learning

Gu 129

studies suggested that language abilities developed in groups with different target lan-guage contact experiences were different in terms of latent factorial structure. The facto-rial invariance found in this study, however, did resonate with what Stricker and Rock (2008) observed in the context of TOEFL testing. Stricker and Rock (2008) found the same higher-order structure across subgroups defined by degree of exposure to English in an EFL setting. By using language contact in an English-speaking environment as a test taker characteristic to define group membership, the results of the current study provided another piece of evidence in support of factorial invariance of the language ability con-struct measured by the TOEFL iBT test.

Furthermore, imposing a mean structure led to the finding that the study-abroad group did not turn out to be better at English compared to the group who had never had such experience of living in an English-speaking environment. Both groups achieved equiva-lent standings on the latent construct means. The results resonated with the observation by Collentine and Freed (2004) that one learning context was not absolutely better than the other. A preference for studying abroad as opposed to classroom learning, a common belief as captured by Freed, Segalowitz and Dewey (2004), could not be upheld based on this study.

The SEM approach used in this study provided an opportunity to understand how different context-of-learning experiences were associated with latent aspects of the language ability. For the home-country group, learning was captured by the lengths of their studying English and studying content classes taught in English. For the study-abroad group, learning was captured by the lengths of their living in an English-speaking country as well as studying English and studying content classes taught in English. This study found significant relationships between various learning variables and the latent ability components. This finding was compatible with results from pre-vious studies showing that both classroom language instruction and studying abroad could lead to language gains (Collentine, 2004; Díaz-Campos, 2004; Lafford, 2004; Sasaki, 2007). However, it also needs to be pointed out that large portions of the fac-tor variances remained unexplained in the models. This finding suggested that varia-bles other than the ones specified in the models might have had associations with the ability construct being investigated.

Although the study-abroad group did not perform differently from the home-country group as indicated by the results from the mean invariance analysis, the length of living in an English-speaking country was found to be significantly associated with both latent ability components for test takers in the study-abroad group. This finding suggested that for study-abroad learners, the longer they had contact with the target language commu-nity, the better their English skills became. However, even within the study-abroad group, test takers’ performance was associated more strongly with the length of study-ing English than with the length of living in an English-speaking country. This again countered the commonly held belief that studying abroad is superior to classroom learn-ing. In practical terms, the outcome of the study suggested that spending time abroad might not be the only way to prepare for the test, or to improve one’s language ability. Both approaches, receiving training in a classroom setting and studying abroad in the target language environment, could be conducive to language learning or aspects of language learning.

at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from

Page 21: At the interface between language testing and second language acquisition: Language ability and context of learning

130 Language Testing 31(1)

Limitations

The analysis was based on data from only one TOEFL iBT administration. The selection of the 370 test takers relied on their self-reported data. When divided into groups in the multi-group analyses, the subgroup samples became relatively small. Therefore, the results should be interpreted and generalized to the whole TOEFL iBT test-taking popu-lation with caution.

The design of the multi-group analyses separated the test takers into two groups, either having or not having studied abroad. However, it should be noted that in this study exposure to an English-speaking environment varied from less than six months to more than one year. Examinees who had lived in English-speaking countries for less than six months may not have benefited from their target language contact experience as much as the ones who had lived abroad for longer periods of time. Grouping test takers with dif-ferent lengths of time abroad together might have diluted the impact of language contact. A study by Davidson (as cited in Dewey, 2004, p. 322) also pointed out that it might take a full year of target language contact for the linguistic benefits to become evident. Future research could use different grouping methods to compare test takers with varying target language contact experiences. Whether there are thresholds for length of time abroad in relation to proficiency gains could also be explored by future researchers.

All learning variables in this study, studying English, taking content classes in English, and living in an English-speaking country, were defined by length in years. The richness of these learning experiences was not fully captured in the models tested. This probably explained why a large portion of the factor variances could not be accounted for in the study. Variables other than the ones specified in the models were not investigated, due to lack of such information. It is recommended that future researchers who share the same interest collect detailed information on the nature and intensity of learners’ language con-tact with the target language community through well-developed instruments. Such an endeavor would encourage collaborative research efforts joined by both language testing and language acquisition researchers. Through a proper method, such as SEM, this line of research would inform not only what constitutes language ability but also how the aspects of this ability are associated with or affected by learning and acquisition factors.

Acknowledgements

I would like to thank Dr Liskin-Gasparro and Dr Ansley for their guidance, help, and encourage-ment in every phase of this study. I also wish to express my deep gratitude to Walker for his support and love.

Funding

This work was supported by the Ballard Seashore Dissertation Year Fellowship from the Graduate College of the University of Iowa; and the Small Grant for Doctoral Research in Second or Foreign Language Assessment from Educational Testing Service.

Notes

1. Since the study sample of 370 test takers was not randomly generated two steps were taken to ensure that this sample was comparable to the total sample of 1000 test takers. First, the study

at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from

Page 22: At the interface between language testing and second language acquisition: Language ability and context of learning

Gu 131

sample (N = 370) was compared to the total sample (N = 1000) on all available background variables collected for the test takers. Second, a series of one sample t-tests were conducted to compare the test performance of the study sample to the one of the total sample. The results suggested that the study sample did not deviate from the total sample significantly with regard to the background variables as well as the test performance.

References

Alderson, J. C., & Banerjee, J. (2002). Language testing and assessment (Part 2). Language Teach-ing, 35, 79–113.

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford Univer-sity Press.

Bachman, L. F. (2000). Modern language testing at the turn of the century: Assuring that what we count counts. Language Testing, 17(1), 1–42.

Bachman, L. F., Davidson, F., Ryan, K., & Choi, I.-C. (1995). An investigation into the compa-rability of two tests of English as a foreign language: The Cambridge-TOEFL comparability study. New York: Cambridge University Press.

Bachman, L. F., & Palmer, A. S. (1982). The construct validation of some components of com-municative proficiency. TESOL Quarterly 16(4), 449–465.

Bachman, L. F., & Palmer, A. S. (1983). The construct validation of the FSI oral interview. In J. W. Oller, Jr (Ed.), Issues in language testing research (pp. 154–169). Rowley, MA: New-bury House.

Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful language tests. Oxford: Oxford University Press.

Bae, J., & Bachman, L. F. (1998). A latent variable approach to listening and reading: Testing factorial invariance across two groups of children in the Korean/English two-way immersion program. Language Testing, 15(3), 380–414.

Bandalos, D. L. (2002). The effects of item parceling on goodness-of-fit and parameter estimate bias in structural equation modeling. Structural Equation Modeling, 9(1), 78–102.

Bandalos, D. L., & Finney, S. J. (2001). Item parceling issues in structural equation modeling. In G. A. Marcoulides & R. E. Schumacker (Eds.), New developments and techniques in structural equation modeling (pp. 269–296). Mahwah, NJ: Lawrence Erlbaum.

Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollon & J. S. Long (Eds.), Testing structural equation models (pp. 136–162). Newbury Park, CA: Sage.

Buck, G. (1992). Listening comprehension: Construct validity and trait characteristics. Language Learning, 42(3), 313–357.

Carroll, J. B. (1965). Fundamental consideration in testing for English language proficiency of foreign students. In H. B. Allen (Ed.), Teaching English as a second language: A book of read-ings (pp. 364–372). New York: McGraw-Hill.

Chalhoub-Deville, M. (1997). Theoretical models, assessment frameworks and test construction. Language Testing, 14(1), 3–22.

Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes of testing measure-ment invariance. Structural Equation Modeling: A Multidisciplinary Journal, 9(2), 233–255.

Collentine, J. (2004). The effects of learning contexts on morphosyntactic and lexical develop-ment. Studies of Second Language Acquisition, 26, 227–248.

Collentine, J., & Freed, B. (2004). Learning context and its effects on second language acquisition. Studies of Second Language Acquisition, 26, 153–171.

Dewey, D. P. (2004). A comparison of reading development by learners of Japanese in intensive domes-tic immersion and study abroad contexts. Studies of Second Language Acquisition, 26, 303–327.

at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from

Page 23: At the interface between language testing and second language acquisition: Language ability and context of learning

132 Language Testing 31(1)

Díaz-Campos, M. (2004). Context of learning in the acquisition of Spanish second language pho-nology. Studies of Second Language Acquisition, 26, 249–273.

Fouly, K. A., Bachman, L. F., & Cziko, G. A. (1990). The divisibility of language competence: A confirmatory approach. Language Learning, 40(1), 1–21.

Freed, B. F., Segalowitz, N., & Dewey, D. P. (2004). Context of learning and second language fluency in French: Comparing regular classroom, study abroad, and intensive domestic immer-sion programs. Studies of Second Language Acquisition, 26, 275–301.

Ginther, A., & Stevens, J. (1998). Language background, ethnicity, and the internal construct validity of the Advanced Placement Spanish Language Examination. In A. J. Kunnan (Ed.), Validation in language assessment (pp. 169–194). Mahwah, NJ: Lawrence Erlbaum.

Harley, B., Cummins, J., Swain, M., & Allen, P. (1990). The nature of language proficiency. In B. Harley, P. Allen, J. Cummins, & M. Swain (Eds.), The development of second language proficiency (pp. 7–25). New York: Cambridge University Press.

Hu, L.-T., & Bentler, P. M. (1999). Cutoff criteria for fit indices in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55.

Kachru, B. B. (1984). World Englishes and the teaching of English to non-native speakers: Con-texts, attitudes, and concerns. TESOL Newsletter, 18, 25–26.

Kline, R. B. (2005). Principles and practice of structural equation modeling (2nd ed.). New York: Guilford.

Kunnan, A. J. (1992). An investigation of a criterion-referenced test using G-theory, and factor and cluster analyses. Language Testing, 9, 30–49.

Kunnan, A. J. (1998). Approach to validation in language assessment. In A. J. Kunnan (Ed.), Vali-dation in language assessment (pp. 1–16). Mahwah, NJ: Lawrence Erlbaum.

Lafford, B. A. (2004). The effect of the context of learning on the use of communication strate-gies by learners of Spanish as a second language. Studies of Second Language Acquisition, 26, 201–225.

Little, T. D., Cunningham, W. A., Shahar, G., & Widaman, K. F. (2002). To parcel or not to parcel: Exploring the question, weighing the merits. Structural Equation Modeling, 9(2), 151–173.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed.) (pp. 13–103). New York: American Council on Education and Macmillan.

Morgan, R., & Mazzeo, J. (1988). A Comparison of the structural relationships among reading, listening, writing, and speaking components of the AP French Language Examination for AP candidates and college students. (ETS Research Report No. 88–59). Princeton, NJ: Educa-tional Testing Service.

Muthén, L. K., & Muthén, B. O. (2010). Mplus User’s Guide (6th ed.). Los Angeles, CA: Muthén & Muthén.

Oller, J. W. Jr. (1979). The factorial structure of language proficiency: Divisible or not? In J. W. Oller, Jr. Language test at school: A pragmatic approach (pp. 423–458). London: Longman.

Römhild, A. (2008). Investigating the invariance of the ECPE factor structure across different pro-ficiency levels. Spaan Fellow Working Papers in Second or Foreign Language Assessment, 6, 29–55. Ann Arbor, MI: University of Michigan English Language Institute. www.isa.umich.edu/eli/research/spaan

Sang, F., Schmitz, B., Vollmer, H. J., Baumert, J., & Roeder, P. M. (1986). Models of second lan-guage competence: A structural equation approach. Language Testing, 3(1), 54–79.

Sasaki, M. (1993). Relationships among second language proficiency, foreign language apti-tude, and intelligence: A structural equation modeling approach. Language Learning, 43(3), 313–344.

Sasaki, M. (2007). Effects of study-abroad experiences on EFL writers: A multiple-data analysis. The Modern Language Journal, 91(4), 602–620.

at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from

Page 24: At the interface between language testing and second language acquisition: Language ability and context of learning

Gu 133

Satorra, A., & Bentler, P. M. (1994). Corrections to test statistics and standard errors on covari-ance structure analysis. In A. von Eye, & C. C. Clogg (Eds.), Latent variables analysis (pp. 399–419). Thousand Oaks, CA: Sage.

Sawaki, Y., Stricker, L., & Oranje, A. (2008). Factor structure of the TOEFL Internet-Based Test (iBT): Exploration in a field trial sample. (TOEFL iBT Research Report No. 04; ETS Research Report No. 08–09). Princeton, NJ: Educational Testing Service.

Shin, S.-K. (2005). Did they take the same test? Examinee language proficiency and the structure of language tests. Language Testing, 22(1) 31–57.

Steenkamp, J. E. M., & Baumgartar, H. (1998). Assessing measurement invariance in cross-national consumer research. Journal of Consumer Research, 2, 78–90.

Stricker, L. J., & Rock, D. A. (2008). Factor structure of the TOEFL Internet-Based Test across subgroups. (TOEFL iBT Research Report No. 07; ETS Research Report No. 08–66). Princ-eton, NJ: Educational Testing Service.

Stricker, L. J., Rock, D. A., & Lee, Y.-W. (2005). Factor structure of the LanguEdge™ Test across language groups. (TOEFL Monograph Series Report No. 32). Princeton, NJ: Educa-tional Testing Service.

Swinton, S. S., & Powers, D. E. (1980). Factor analysis of the Test of English as a Foreign Lan-guage for several language groups. (TOEFL Research Report No. 06; ETS Research Report No. 80–32). Princeton, NJ: Educational Testing Service.

Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organi-zational Research Methods, 3(1), 4–70.

Wang, S. D. (2006). Validation and invariance of factor structure of the ECPE and MELAB across gender. Spaan Fellow Working Papers in Second or Foreign Language Assessment, 4, 41–56. Ann Arbor, MI: University of Michigan English Language Institute. www.isa.umich.edu/eli/research/spaan

Wolf, M. K., Kao, J., Herman, J., Bachman, L. F., Bailey, A., Bachman, P. L., Farnsworth, T., & Chang, S. M. (2008). Issues in assessing English language learners: English language profi-ciency measures and accommodation uses—Literature review (Part 1 of 3). (CRESST Report No. 731). Los Angeles, CA: CRESST/UCLA.

Appendix A: A list of test taker background variables

1. Test-taking location 2. Age 3. Gender 4. Native country 5. Native language 6. Reason for taking the test 7. Type of institution interested 8. Amount of financial support expected 9. Amount of time spent studying English10. Amount of time spent in content classes taught in English11. Amount of spent living in an English-speaking country

at National Dong Hwa University on April 4, 2014ltj.sagepub.comDownloaded from