an argument-based validity inquiry into the empirically ... · this study built and supported...
TRANSCRIPT
![Page 1: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/1.jpg)
An Argument-Based Validity Inquiry into the Empirically-Derived Descriptor-
Based Diagnostic (EDD) Assessment in ESL Academic Writing
by
Youn-Hee Kim
A thesis submitted in conformity with the requirements
for the degree of Doctor of Philosophy
Department of Curriculum, Teaching and Learning
Ontario Institute for Studies in Education
University of Toronto
© Copyright by Youn-Hee Kim (2010)
![Page 2: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/2.jpg)
ii
An Argument-Based Validity Inquiry into the Empirically-Derived Descriptor-
Based Diagnostic (EDD) Assessment in ESL Academic Writing
Doctor of Philosophy (2010)
Youn-Hee Kim
Department of Curriculum, Teaching and Learning
University of Toronto
Abstract
This study built and supported arguments for the use of diagnostic assessment in
English as a second language (ESL) academic writing. In the two-phase study, a new
diagnostic assessment scheme, called the Empirically-derived Descriptor-based
Diagnostic (EDD) checklist, was developed and validated for use in small-scale
classroom assessment. The checklist assesses ESL academic writing ability using
empirically-derived evaluation criteria and estimates skill parameters in a way that
overcomes the problems associated with the number of items in diagnostic models.
Interpretations of and uses for the EDD checklist were validated using five assumptions:
(a) that the empirically-derived diagnostic descriptors that make up the EDD checklist
are relevant to the construct of ESL academic writing; (b) that the scores derived from
the EDD checklist are generalizable across different teachers and essay prompts; (c) that
performance on the EDD checklist is related to performance on other measures of ESL
academic writing; (d) that the EDD checklist provides a useful diagnostic skill profile for
ESL academic writing; and (e) that the EDD checklist helps teachers make appropriate
diagnostic decisions and has the potential to positively impact teaching and learning ESL
academic writing.
Using a mixed-methods research design, four ESL writing experts created the
EDD checklist from 35 descriptors of ESL academic writing. These descriptors had been
![Page 3: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/3.jpg)
iii
elicited from nine ESL teachers‟ think-aloud verbal protocols, in which they provided
diagnostic feedback on ESL essays. Ten ESL teachers utilized the checklist to assess 480
ESL essays and were interviewed about its usefulness. Content reviews from ESL writing
experts and statistical dimensionality analyses determined that the underlying structure of
the EDD checklist consists of five distinct writing skills: content fulfillment,
organizational effectiveness, grammatical knowledge, vocabulary use, and mechanics.
The Reduced Reparameterized Unified Model (Hartz, Roussos, & Stout, 2002) then
demonstrated the diagnostic quality of the checklist and produced fine-grained writing
skill profiles for individual students. Overall teacher evaluation further justified the
validity claims for the use of the checklist. The pedagogical implications of the use of
diagnostic assessment in ESL academic writing were discussed, as were the contributions
that it would make to the theory and practice of second language writing instruction and
assessment.
![Page 4: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/4.jpg)
iv
Acknowledgements
Looking back on my life as a graduate student at OISE/UT, I feel that I was most
fortunate to have had the opportunity to grow and develop as an academic. My PhD
program not only provided me with intellectual knowledge regarding second language
education and educational measurement, but also transformed, shifted, and nurtured my
beliefs, thoughts, and values. This invaluable learning experience was made possible by
the constant guidance of many people.
First of all, I would like to express my deep gratitude to my dissertation
supervisor, Dr. Eunice Jang, for her enormous support and encouragement. Dr. Eunice
Jang was a wonderful academic advisor and the strongest supporter of my research. She
was always available when I needed her and would sit for hours, patiently listening to me,
inspiring me to think deeper, and guiding me in the most advantageous direction. It was
also a privilege to work with her on numerous language assessment projects. Without her,
my four-year PhD journey would not have been quite so fulfilling and rewarding.
My appreciation also goes to Dr. Ruth Childs. Her expertise in educational
measurement was an invaluable resource, and her advice, insight, and enthusiasm
inspired me to complete this research project. Although I took numerous statistics and
educational measurement courses with her, I already miss her Test Theory course. She
demonstrated that statistical concepts are not necessarily complex and can be easily
applied to other research inquiries. I would also like to thank her for inviting me to her
Datahost laboratory meetings, where I was able to meet outstanding psychometrician
colleagues.
Words cannot sufficiently express my gratitude to Dr. Sharon Lapkin. She was
always there when I needed her desperately, giving me unselfish support and
encouragement. Her commitment and enthusiasm to second language acquisition and
learning research was also inspirational and something that I wish to forever emulate. I
was most fortunate to be around her, as she was an excellent role model to many graduate
students. I will never forget her unwavering support and encouragement.
This research would not have been possible without the generous and kind
assistance of many people. I am especially thankful to the ESL teachers who participated
![Page 5: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/5.jpg)
v
in the study. They spent many hours marking essays and proposing ways to develop a
more effective assessment scheme. I would also like to thank Mohammed Al-Alawi,
Seung Won Jun, Robert Kohls, and Jennifer Wilson for their time and insightful
suggestions to my study. Thanks are also due to my friends and colleagues in Modern
Language Centre at OISE/UT for their friendship. I am sincerely thankful to Khaled
Barkaoui, Seung Won Jun, Eun-Yong Kim, Robert Kohls, Geoff Lawrence, Hyunjung
Shin, Wataru Suzuki, Yasuyo Tomita, and Jennifer Wilson.
I would also like to acknowledge that this research project was fully supported
by The International Research Foundation for English Language Education (TIRF)
Doctoral Dissertation Grant and TOEFL Small Grants for Doctoral Research in Second
or Foreign Language Assessment. I am also deeply grateful for the financial support by
OISE/UT, which enabled me to continue my PhD program in Toronto for several years.
My appreciation also goes to Mr. Jaewoon Choi, the former principal of Daegu
Foreign Language High School in Korea. My memory of him dates back ten years when
I worked as an English teacher at the school. One day, he took me out for dinner and
asked me what it was like to be an English teacher. The conversation that we had
reestablished my vision as an educator and motivated me to pursue a higher education.
Without that thought-provoking moment, I would not have ever dreamed of pursuing a
graduate degree. I miss his intellect and insights and hope that our paths will cross
someday.
I am also deeply indebted to Daegu Foreign Language High School and Daegu
Metropolitan Office of Education for their unwavering support during my leave of
absence. I would like to sincerely thank to Principal Sung-Whan Choi, Vice Principal
Sang-Ho Soh, former Vice Principal Young-Ok Noh, Mr. Jaehan Bae, and many other
English teachers at the school. I am also grateful to Mr. Young-Mok Nam at Daegu
Metropolitan Office of Education.
My warmest thanks go out to my family in Korea. It would not have been
possible for me to complete my PhD program without their love, patience, and
understanding. My uncle and aunt in Chicago also deserve my deepest gratitude. I cannot
forget the summer of 2006 when they crossed the border with their car packed with my
many belongings. Thanks to their help, I was able to settle in Toronto without difficulty. I
![Page 6: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/6.jpg)
vi
am also sincerely thankful to my mother, who made long distance calls every morning to
remind me that she stood by me and loved me. Our daily dialogues meant much more to
me than mere words and they remain a happy memory. I dedicate this dissertation to her.
![Page 7: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/7.jpg)
vii
TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION…………………………………………………….........1
Overview of the Research Problem………………………………………………1
Argument-Based Approaches to Validity………………………………………...5
Overarching Research Framework……………………………………………10
Research Questions……………………………………………………………..12
Significance of the Study……………………………………………………….13
Chapter Overview………………………………………………………………14
CHAPTER 2 REVIEW OF LITERATURE……………………………………………..15
Approaches to L2 Writing Assessment…………………………………………15
Approaches to Diagnostic Assessment………………………………………….47
CHAPTER 3 METHODOLOGY………………………………………………………..57
Research Questions……………………………………………………………..57
Research Design Overview……………………………………………………..57
Participants……………………………………………………………………...62
Instruments……………………………………………………………………...64
Data Collection and Analysis Procedures………………………………………67
Summary………………………………………………………………………..85
CHAPTER 4 DEVELOPMENT OF THE EDD CHECKLIST…………………………87
Introduction……………………………………………………………………..87
Identification of EDD Descriptors……………………………………………...87
Characteristics of EDD Descriptors……………………………………….......122
Refinement of EDD Descriptors………………………………………………125
Summary………………………………………………………………………129
CHAPTER 5 PRELIMINARY EVALUATION OF THE EDD CHECKLIST………...130
Introduction……………………………………………………………………130
![Page 8: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/8.jpg)
viii
Teacher and Essay Prompt Effects…………………………………………….130
Correlation between EDD and TOEFL Scores………………………………..141
Teacher Perceptions and Evaluations………………………………………….141
Summary………………………………………………………………………148
CHAPTER 6 PRIMARY EVALUATION OF THE EDD CHECKLIST………………149
Introduction……………………………………………………………………149
Characteristics of the Diagnostic ESL Academic Writing Skill Profiles……149
Correlation between EDD and TOEFL Scores………………………………..176
Teacher Perceptions and Evaluations………………………………………….177
Summary………………………………………………………………………195
CHAPTER 7 SYNTHESIS…………………………………………………………….196
Introduction……………………………………………………………………196
Validity Assumptions Revisited……………………………………………….196
Implications……………………………………………………………………208
Suggestions for Future Research………………………………………………216
REFERENCES…………………………………………………………………………219
APPENDIX A Definitions of Key Terms………………………………………………239
APPENDIX B ESL Teacher Profile……………………………………………………241
APPENDIX C Guidelines for a Think-aloud Session………………………….............242
APPENDIX D Teacher Questionnaire…………………………………………………247
APPENDIX E Guiding Interview Questions for Teachers…………………..................258
APPENDIX F Textual Characteristics of the Three Essay Sets………..........................260
APPENDIX G Order of Essays in Each Set……………………………………………263
APPENDIX H Excerpts from Teacher Think-aloud Transcripts………………………264
APPENDIX I The EDD Checklist……………………………………………………267
APPENDIX J The EDD Checklist With Confidence Level……………………............269
APPENDIX K Assessment Guidelines I………………………………………………271
APPENDIX L Assessment Guidelines II………………………………………………274
![Page 9: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/9.jpg)
ix
APPENDIX M Correlations Between ETS Scores and Teacher Scores………………279
APPENDIX N Descriptor Measure Statistics…………………………………………280
APPENDIX O The Initial Q-Matrix……………………………………………………281
![Page 10: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/10.jpg)
x
LIST OF TABLES
Table Page
1 Synthesis of Writing Construct Elements……………………………………….28
2 Research Design Summary……………………………………………………59
3 The Four Largest First Language Groups………………………………………62
4 Distribution of Test-Takers by Language Groups………………………………63
5 Profile of ESL Writing Experts…………………………………………………64
6 Score Distribution of the TOEFL iBT Independent Essays……………………65
7 Score Distribution of the Three Essay Sets……………………………………67
8 Volume of the Teachers‟ Think-aloud Transcripts………………………………70
9 Distribution of Essay Batches in the Pilot Study………………………………74
10 Distribution of Essay Batches in the Main Study………………………………78
11 39 Descriptors of ESL Academic Writing Skills………………………………89
12 Inter-Coder Reliability for the 39 Descriptors…………………………………..91
13 Frequency of Descriptors by Teachers and Essay Sets…………………………93
14 Refined 35 EDD Descriptors…………………………………………………127
15 FACETS Data Summary………………………………………………………131
16 Distribution of Unexpected Responses across Teachers………………………131
17 Teacher Measure Statistics……………………………………………………135
18 Teacher Effect…………………………………………………………………137
19 Teacher Agreement on Descriptors……………………………………………139
20 Interactions between Teachers and Descriptors………………………………140
21 Teacher Confidence (%) on the Subject Prompt………………………………143
22 Teacher Confidence (%) on the Cooperation Prompt…………………………144
23 Experts‟ Descriptor Classification……………………………………………152
24 Descriptor Clusters Identified by DETECT……………………………………154
25 Confirmatory DIMTEST Results………………………………………………156
26 Initial Descriptor Parameter Estimates…………………………………………158
27 The Final Descriptor Parameter Estimates……………………………………160
28 Descriptors with Poor Diagnostic Power………………………………………164
29 Consistency Indices of Skill Classification…………………………………….168
30 Proportion of Incorrect Patterns Classified by the Number of Skills…………168
31 Case Profiles……………………………………………………………………175
![Page 11: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/11.jpg)
xi
LIST OF FIGURES
Figure Page
1 A general procedure for EBB scale development……………………………41
2 FACET variable map………………………………………………………134
3 The scatter plot for teacher agreement and confidence………………………145
4 CCPROX/HCA results………………………………………………………155
5 Density, time-series, and autocorrelation plots for pMCH………………………..157
6 Density, time-series, and autocorrelation plots for r ………………………..157
7 Proportion of skill masters (pk)………………………………………………161
8 Observed and predicted score distributions…………………………………162
9 The relationship between the number of mastered skills and the total
scores…………………………………………………………………………163
10 Performance difference between descriptor masters and non-masters………163
11 Classification of skill mastery………………………………………………165
12 Distribution of the number of mastered skills………………………………166
13 The most common skill mastery pattern in each number of skill mastery
categories……………………………………………………………………….167
14 Proportion of masters for the subject and cooperation prompts……………169
15 The most common skill mastery patterns for the subject prompt……………170
16 The most common skill mastery patterns for the cooperation prompt………170
17 Number of mastered skills for the subject and cooperation prompts………171
18 Proportion of masters across different proficiency levels……………………172
19 Proportion of masters across different proficiency levels for the subject
prompt……………………………………………………………………..……173
20 Proportion of masters across different proficiency levels for the cooperation
prompt…………………………………………………………………………174
21 Number of mastered skills across different proficiency levels………………175
22 An example of the diagnostic ESL writing profile…………………………213
*
2.2
![Page 12: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/12.jpg)
xii
To My Mother
![Page 13: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/13.jpg)
1
CHAPTER 1
INTRODUCTION
Overview of the Research Problem
Responding to students‟ writing is an important aspect of second language (L2)
writing programs that is fundamentally concerned with the successful development of
their L2 writing skills. Teachers spend a substantial amount of time providing appropriate
feedback about students‟ strengths and weaknesses, which students incorporate into their
studies going forward. The significance of feedback has been emphasized in process-
oriented writing instruction, where students have the freedom to revise and resubmit
multiple drafts of their work (Ferris, 2003). As interest in the effect of teacher feedback
on L2 writing has increased, a great deal of recent research has been devoted to exploring
this aspect of second language education. Particular focus has been placed upon whether
feedback makes a difference in students‟ writing; what role it plays in enhancing students‟
writing; effective ways of delivering feedback; and how students react to it (Hyland &
Hyland, 2006). A number of studies have also investigated the nature and effect of
different types of feedback: written, oral, content, form-focused, teacher, peer, computer-
mediated, one-to-one teacher-student conferences, and so on. This proliferation of
research demonstrates the increasing importance of feedback in all writing programs, and
illustrates how teachers and students alike have striven for much finer-grained diagnostic
information about specific writing skills in an L2 context.
Along the same lines, researchers in educational assessment and measurement
have recently shown increased interest in diagnostic approaches that assess and monitor
students‟ progress in particular academic domains. According to Kunnan and Jang (2009),
The main vision in using diagnostic assessment in large-scale and classroom
assessment contexts is to help assess students‟ abilities and understanding with
feedback not only about what students know, but about how they think and learn
in content domains, to help teachers have resources of a variety of research-
based classroom assessment tools, to help recognize and support students‟
strengths and create more optimal learning environments, and to help students
become critical evaluators of their own learning (Pellegrino, Chudowsky, &
Glaser, 2001).
Partly in response to the limitations of outcome-based assessments from proficiency or
achievement tests, researchers turned to diagnostic assessments that can maximize
![Page 14: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/14.jpg)
2
pedagogical gains by integrating assessments with instruction and curriculum (Nichols,
1994; Nichols, Chipman, & Brennan, 1995; Pellegrino & Chudowsky, 2003). Current
trends towards higher-quality education and greater accountability have also resulted in
an increasing demand for diagnostic information about individual students‟ strengths and
weaknesses in classroom-based and large-scale assessments (Leighton & Gierl, 2007).
The current No Child Left Behind (NCLB) legislation in the United States tracks
students‟ academic achievement, providing useful information to students, parents,
teachers, principals, and school district administrators (DiBello & Stout, 2007).
Standardized large-scale assessment is also moving toward diagnostic assessment; the
College Board‟s Score Report PlusTM
provides detailed information about the test
performance of students who have taken the Preliminary Scholastic Achievement Test
(PSAT) and the National Merit Scholarship Qualifying Test (NMSQT) (DiBello & Stout,
2007).
In L2 assessment and testing, the pressing need for diagnostic assessment is
illustrated by the advent of DIALANG. DIALANG is a European-funded project that
develops computer-based diagnostic tests assessing five aspects of language knowledge
(reading, listening, writing, grammar and vocabulary) in 14 European languages
(Alderson, 2005). It provides detailed information about test results to learners based on
the guiding principles of the Common European Framework of Reference for languages
(CEFR). Diagnostic scores are reported separately on each subskill in each aspect of
language knowledge, and students can review their assessment profiles and determine
which subskills they need to improve.
The importance of diagnostic information has also been emphasized in
constructing a rating scale. Acknowledging the limitations of behavior-based rating
scales, Brindley (1998) called for research into diagnosis-oriented rating scales:
Rather than continuing to proliferate scales which use generalized and
empirically unsubstantiated descriptors, therefore, it would perhaps be more
profitable to draw on SLA [Second Language Acquisition] and LT [Language
Testing] research to develop more specific empirically derived and
diagnostically oriented scales [italics added] of task performance which are
relevant to particular purposes of language use in particular contexts and to
investigate the extent to which performance on these tasks taps common
components of competence. (p. 134)
![Page 15: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/15.jpg)
3
Recognizing the problems associated with intuitive or a priori methods in most rating
scales, he placed particular emphasis on empirical sources, as well as the diagnostic
functions that rating scales must have. In a similar vein, Pollitt and Murray (1996)
suggested diagnosis-oriented rating scales, pointing out the limited view of Alderson‟s
(1991) trichotomous classification of rating scales (i.e., user-oriented, constructor-
oriented, and assessor-oriented scales). The area of English for Specific Purposes (ESP)
was no exception; Grove and Brown (2001) proposed a diagnostic assessment scheme
that can help to assess medical students‟ oral communicative skills. Although she did not
empirically develop or validate it, Luoma (2004) also put forward the idea of a diagnostic
rating checklist for assessing L2 oral proficiency.
Despite the increasing interest in and need for a diagnostic approach to
educational assessment and L2 assessment and testing, very little research has been
devoted to it. The literature in this area is scant and the concepts are confusing, with no
theoretical foundation (Alderson, 2007). In addition, few principles exist that guide the
development of a diagnostic test, as we know little about what underlying constructs
should be identified, operationalized, and measured (Alderson, 2007). The technical
knowledge that frames diagnostic assessment is also in its early stages (Jang, 2008). In
particular, psychometric measurement models that operationalize diagnostic assessment
are relatively new and therefore little-explored methods in L2 assessment and testing.
Apart from Buck and Tatsuoka‟s (1988) pioneering introduction of the Rule-Space
procedure to L2 assessment and testing, only a handful of studies have attempted to
empirically explore the potential applications of psychometric diagnostic models to L2
assessment and testing. Jang (2005, 2009a) applied the Reduced Reparameterized
Unified Model ([Reduced RUM], Hartz, Roussos, & Stout, 2002) to two forms of the
reading subtest in the LanguEdge English Language Learning Assessment in order to
evaluate the effect of skills diagnosis on teaching and learning, while Lee and Sawaki
(2009a) investigated the comparability of different diagnostic assessment models
(General Diagnostic Model [von Davier, 2005], Fusion Model [Hartz et al., 2002], and
Latent Class Analysis [Yamamoto & Gitomer, 1993]) on the operational TOEFL iBT
reading and listening subtests. The lack of other significant research warrants further
investigation into how diagnostic models might be incorporated to L2 assessment and
![Page 16: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/16.jpg)
4
testing.
Considering that students want substantial diagnostic feedback from their
teachers in L2 writing programs (Cohen & Cavalcanti, 1990; Ferris, 1995, 2003; Ferris &
Roberts, 2001; Hedgcock & Lefkowitz, 1994; Hyland, 1998; Lee, 2004; Leki, 1991,
2006; Zhang, 1995), it is conceivable that a diagnostic approach might be an appropriate
means of helping to guide and monitor students‟ L2 writing progress. While researchers
agree on the potential of such a method, a diagnostic approach has not been encouraged
in a direct L2 writing assessment context. For example, Alderson (2005) argued that “…
in the case of diagnostic tests, which seek to identify relevant components of writing
ability, and assess writers‟ strengths and weaknesses in terms of such components, the
justification for indirect tests of writing is more compelling” (p. 155-156). He further
expressed reservations about the use of diagnostic tests in assessing higher-order
integrated language skills:
Indeed, diagnosis is more likely to use relatively discrete-point methods than
integrative ones, for the simple reason that it is harder to interpret performance
on integrated or more global tasks. Thus, for the purpose of identification of
weaknesses, diagnostic tests are more likely to focus on specific elements than
on global abilities, on language rather than on language use skills, and on „low-
level‟ language abilities (for example, phoneme discrimination in listening tests)
than „higher-order‟ integrated skills. (p. 257)
Alderson‟s arguments suggest that L2 writing is a multi-faceted and complicated mental
process, and that it is difficult to deconstruct L2 writing into separate elements that
contain tangible diagnostic information; however, his claims have yet to be empirically
investigated and warrant further supporting evidence.
One significant challenge to providing appropriate diagnostic feedback on direct
L2 writing skill assessments is the limited number of writing items that students
complete in testing situations. Students in most large-scale foreign language assessment
programs are required to complete just a single writing item in a given amount of time,
and their L2 writing ability is judged holistically rather than analytically. An aggregate
single score is awarded for the overall quality of writing, but little information is
provided to students about their strengths or weaknesses with regard to specific L2
writing skills. Even when an analytic rating scheme is used, it is difficult to gain a
detailed descriptive evaluation beyond the separate subscore on each writing subskill.
![Page 17: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/17.jpg)
5
The problem becomes more serious when a small number of writing items are subject to
psychometric measurement models, greatly increasing the likelihood of measurement
errors. This is particularly true in item response theory (IRT)-based diagnostic models,
which require a large sample in order to provide a stable estimate of skill parameters.
This problem must be resolved if L2 diagnostic models are to generate the most accurate
and fine-grained diagnostic information possible.
In response to this need for research, this study built and supported arguments
for the use of diagnostic assessment in English as a second language (ESL) academic
writing. In the two-phase research, a new diagnostic assessment scheme, called the
Empirically-derived Descriptor-based Diagnostic (EDD) checklist, was developed and
validated for use in small-scale classroom assessment. The EDD checklist assesses ESL
academic writing ability based on empirically-derived evaluation criteria and estimates
skill parameters in a novel way that overcomes the problems and limitations associated
with the number of items in diagnostic models. Interpretations of and uses for the EDD
checklist were validated using multiple data sources and from diverse perspectives.
Argument-based approaches to validity provided an overarching logical framework that
guided the development of the EDD checklist and justified its score-based interpretations
and uses. I hope that the argument-based evidentiary reasoning in this study will
ultimately help to examine whether scores derived from the EDD checklist can be used to
diagnose the domain of writing skills required in ESL academic context.
Argument-Based Approaches to Validity
Building on the work of Cronbach (1971, 1988), House (1980), and Messick
(1989), Kane (1992) argued that test-score interpretation is associated with a chain of
interpretive arguments, and that the plausibility of those arguments determines the
validity of test-score interpretations. Kane also made it clear that validity is connected to
the interpretation of a test score rather than to a test or to the test score itself. In his
seminal article, An argument-based approach to validity, he suggested that interpretive
arguments establish a network of inferences from observations to score-based
conclusions and decisions, and guide the collection of relevant evidence that supports
those inferences and assumptions. A series of different types of inferences are laid out in
![Page 18: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/18.jpg)
6
interpretive arguments, each of which is articulated based on its underlying assumptions
(Crooks, Kane, & Cohen, 1996; Kane, 1992, 1994; Shepard, 1993). Influenced by the
school of practical reasoning or informal logic (Cronbach, 1982, 1988; House, 1980),
Kane further noted that interpretive arguments are practical in that they are based on
assumptions which cannot be taken as given, and that the evidence supporting those
assumptions cannot be complete. The arguments in test-score interpretations are therefore
plausible or credible, but not decisive, given all available evidence. Kane suggested three
criteria for evaluating so-called practical arguments: (a) the clarity of the argument, (b)
the coherence of the argument, and (c) the plausibility of assumptions. He took particular
care to note that the weakest and most questionable assumptions must be identified and
supported by multiple pieces of evidence (Kane, 1992, 2001, 2004), and that
counterarguments must be identified and refuted in order to reinforce practical arguments
(Cronbach, 1971; Kane, Crooks, & Cohen, 1999; Messick, 1989).
Kane (1992) and Kane et al. (1999) defined the inferences in interpretive
arguments as evaluation, generalization, extrapolation and explanation, and decision.
Each inferential link is based on an assumption that must be supported by evidence. The
first inference, evaluation, links observation of a performance to an observed score, and
is based on the assumptions that the observed performance and test-score interpretation
occur under the same conditions, and that the scoring criteria are used in an appropriate
and consistent manner. Evidence in support of the inference on evaluation is collected by
examining how a test is administered, how students‟ responses are scored, and what
scoring criteria are used. The second inference, generalization, links an observed score
on a particular test to a universe score (i.e., a score on a test that is similar to the one
from which the observed score is drawn), which assumes that the observed score is based
on random or representative samples from the universe of generalization. The evidence
supporting an inference on generalization involves reliability or generalizability analysis.
The third inference, extrapolation, links a universe score to a target score or score-based
interpretation, extrapolating from a narrowly-defined universe of generalization to a
score on a widely-defined target domain beyond the test. The underlying assumption is
that a score on a test reflects performance on a relevant target domain. Criterion-related
validity evidence can support the inference on extrapolation. The link, explanation, is a
![Page 19: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/19.jpg)
7
theory-based inference that relates to the construct of interest, assuming that the theory is
plausible. The final inference, decision, links a score-based interpretation to a decision,
based on assumptions about the values and consequences of test use. Kane et al. (1999)
and Kane (2001) cautioned that each inference should be convincing; if any is not
convincing, the defensibility of the entire interpretive argument will be undermined.
Refining Kane (1992) and Kane et al.‟s (1999) earlier model of an argument-
based approach to validation, Kane (2001, 2002) further classified interpretations into
descriptive and decision-based or prescriptive. Descriptive interpretations involve
inference from a score to a descriptive estimation of examinees‟ ability without an
explicit statement about the use of test scores, whereas decision-based or prescriptive
interpretations involve making decisions about examinees based on the descriptive
interpretations. Descriptive interpretations are usually subsumed to decision-based
interpretations. Therefore, there might be cases in which a descriptive interpretation on a
particular test score is valid, but a decision-based interpretation of the use of the test
score is not valid (Kane, 2002). Kane (2002) also classifies inferences and their
supporting assumptions into semantic and policy. Semantic inferences and assumptions
involve descriptive interpretations of test scores, whereas policy inferences and
assumptions are associated with decision-based interpretations. Policy inferences and
assumptions are typically evaluated according to the consequences of a particular
decision: a policy with positive consequences is considered effective, whereas a policy
with negative consequences is ineffective (Kane, 2002). Despite the importance of
decision-based interpretation and policy inferences and assumptions, Kane (2002) points
out that most validity research has glossed over these areas (Kane, 2002). Argument-
based approaches have recently been viewed as a two-part scheme consisting of an
interpretive argument and a validity argument. An interpretive argument states an
intended interpretation and use of a test score, whereas a validity argument critically
evaluates the plausibility of the interpretive argument based upon empirical investigation
of its inferences and assumptions (Kane, 2002, 2004).
The substantive argument-based approach has been taken up by others. Mislevy,
Steinberg, and Almond (2002, 2003) proposed an assessment argument mechanism,
evidence-centered assessment design (ECD), which organizes interrelated assessment
![Page 20: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/20.jpg)
8
elements from conceptual test development to operational processes. In the ECD
framework, coherent arguments are established and assessment elements are developed
to transform the arguments into an operational process (Mislevy et al., 2002). ECD views
assessment as evidentiary reasoning, and builds and supports an assessment argument
that aligns with the assessment purpose. Evidentiary reasoning is influenced by
Toulmin‟s (2003) argument structure, relying upon the chain of claim, data, warrant,
backing, and alternative explanations. According to Toulmin, a chain of reasoning makes
claims based on data, warrants, and rebuttals. A claim is a conclusion that we wish to
justify based upon data (what Kane called an interpretive argument). A datum (or data) is
(are) any available information, fact, or evidence on which a claim is built, while a
warrant is justification of the inferential link between the claim and the data. A backing
is any theory, previous research, experience, or evidence that supports the warrant, a
rebuttal is a counterclaim that undermines the inference from data to claim, and rebuttal
data are what support or weaken the alternative (Mislevy et al., 2003).
An argument-based approach such as Kane‟s interpretive argumentation and
Mislevy et al.‟s evidentiary reasoning has recently been promoted by Bachman (2005),
who proposed an assessment use argument emphasizing the central role of test use or
consequences in a validity argument. Bachman acknowledged that there are no
systematic principles or practical procedures explicitly linking scores and score-based
interpretations to test use and the consequences of test use, in spite of substantial
awareness of test use and consequences in validity arguments (e.g., Kane, 2001, 2002,
2004; Messick, 1989). He also pointed out that issues that fundamentally concern validity,
such as test usefulness (Bachman & Palmer, 1996), fairness (Kunnan, 2004), or ethics
(Lynch, 2001), have been separately addressed and have no direct association with
validity. This lack of a substantive and integrative approach to validity has led Bachman
to develop an assessment use argument based on Kane‟s interpretive argumentation,
Mislevy et al.‟s evidentiary reasoning, and Toulmin‟s argument structure.
An assessment use argument is a two-fold approach, consisting of an assessment
validity argument and an assessment utilization argument. An assessment validity
argument involves an inferential link from performance on a test to an interpretation of a
test score, whereas an assessment utilization argument links a score-based interpretation
![Page 21: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/21.jpg)
9
to a decision or test use. An assessment utilization argument is associated with what Kane
(2001, 2002) called the decision-based or prescriptive interpretations and policy
inferences. Bachman (2003, 2005) argues that an argument-based approach to validity
should extend to an assessment utilization argument, underscoring the consequences of
test use.
Following Toulmin (2003), both argument structures are built upon a chain of
inferences supported by warrants and backing. In the assessment validity argument,
performance on a test comprises data, and a claim with regard to a test-score
interpretation is made based upon that data. In the assessment utilization argument, test-
score interpretations drawn from the assessment validity argument become the data, and
a decision is made based on the validity claim. The decision to be made becomes the
claim, which is supported by four different types of warrants: relevance, utility, intended
consequences, and sufficiency. Relevance and utility indicate the degree to which a test-
score interpretation is relevant to and useful for decision making, while intended
consequences determine whether using the assessment to make a decision will bring
beneficial consequences to assessment users. Finally, sufficiency refers to the degree to
which the assessment provides sufficient information to make a decision. The rebuttals
are counterclaims that do not justify the inference from the data to the claim, and four
different types of rebuttals can be articulated to challenge each warrant.1
The argument-based approach to validation provides a logical, coherent, and
unified set of procedures that can guide test developers and help assessment users to
formulate and justify score-based interpretations and assessment decisions (Bachman,
2005; Kane, 2001). This approach has been well accepted across disciplines, and has
been applied across a wide range of studies.
1 Bachman (2005) also suggests two general rebuttals: “reasons for not making the intended decision, or
for making a different decision,” and “unintended consequences of using the assessment and/or making the
decision” (p. 21). However, Fulcher and Davidson (2007) point out that these two rebuttals were derived
from his misunderstanding of the nature of rebuttal or counterclaim. Citing Toulmin (2003), where rebuttal
was defined as “circumstances in which the general authority of the warrant would have to be set aside (p.
94),” they made it clear that rebuttals should be related to warrants in order to refute the argument.
Rebuttals in assessment utilization argument should thus be associated with four types of warrants.
![Page 22: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/22.jpg)
10
Overarching Research Framework
This study built and supported arguments for the score-based interpretation and
use of the EDD checklist in ESL academic writing. The central research questions were
formulated based upon the logical process of the argument-based approach to validity,
guiding a set of comprehensive procedures for the development of the checklist and
justifying its score-based interpretations and uses. In order to address various aspects of
validity inferences, the following assumptions pertaining to different types of evidence
were examined:
The empirically-derived diagnostic descriptors that make up the EDD checklist
are relevant to the construct of ESL academic writing.
The scores derived from the EDD checklist are generalizable across different
teachers and essay prompts.
Performance on the EDD checklist is related to performance on other measures
of ESL academic writing.
The EDD checklist provides a useful diagnostic skill profile for ESL academic
writing.
The EDD checklist helps teachers make appropriate diagnostic decisions and has
the potential to positively impact teaching and learning ESL academic writing.
The first assumption suggests that the empirically-derived descriptors that make
up the EDD checklist reflect knowledge, processes, and strategies consistent with the
construct of ESL writing in an academic context. In order to test this assumption,
theoretical discussions on ESL academic writing assessment were reviewed and
compared, with a special focus on ESL writing rating scale research and development
procedures. The extent to which EDD descriptors can be viewed independently of each
other or divided into multiple subskills of ESL writing were also explored from diverse
perspectives, using content reviews from ESL academic writing experts and statistical
dimensionality evaluations of the descriptors. If the checklist reflects a multidimensional
view of L2 academic writing (Cumming, 2001; Cumming et al., 2000) and assesses such
diverse aspects as content, organization, and language use, a theory-based inference
would be supported.
The second assumption addresses the potential impact of various sources of
random error associated with sampling conditions of observation. Rater and test method
effects are a critical factor that threatens the valid interpretations of test scores and
![Page 23: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/23.jpg)
11
increases the likelihood of construct-relevant or irrelevant errors. If the scores derived
from the EDD checklist are constant under different conditions of observation involving
different teachers and essay prompts, the generalizability assumption will be supported. A
many-faceted Rasch model was used to explore the reliability issues associated with
teachers and essay prompts. If teachers exhibit random rating patterns or essay prompts
exhibit bias against the EDD checklist, it will undermine valid score interpretations.
The third assumption, related to concurrent or criterion-related validity, examines
the extent to which scores awarded using the EDD checklist are related to those awarded
using other measures of ESL academic writing. This assumption does not necessarily
seek convergent evidence among different measures of ESL academic writing because a
single measure should not automatically be the norm against which others are compared.
Instead, divergent evidence could provide additional insight into the target constructs that
the two different measures intend to assess. A correlation between scores awarded using
the EDD checklist and scores awarded using the TOEFL independent writing rating scale
was calculated. If the two sets of scores are highly correlated, it can be assumed that the
checklist is an effective measure of the ESL writing ability required in an academic
context. However, a low correlation does not necessarily mean that the EDD checklist
does not meet this criterion; rather, it will highlight the different purposes for which the
two measures were developed. While the TOEFL rating scale is intended to place ESL
students into the appropriate proficiency levels, the EDD checklist is intended to provide
them with fine-grained diagnostic feedback.
The fourth assumption suggests that writing skill profiles generated using the
EDD checklist will provide useful and sufficient diagnostic information about students‟
strengths and weaknesses in ESL academic writing. This assumption also examines the
extent to which score interpretations made using the EDD checklist are accurate and
reliable. The Reduced Reparameterized Unified Model ([Reduced RUM], Hartz et al.,
2002) was used to explore the diagnostic quality of the checklist from multiple
perspectives. If the evidence indicates strong diagnostic power, it will support the
interpretive inference.
The fifth and final assumption concerns the extent to which the EDD checklist
helps teachers make appropriate and confident diagnostic decisions and gives them a
![Page 24: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/24.jpg)
12
positive perception of the checklist‟s diagnostic usefulness. The evidence needed to
support or reject this assumption was gathered primarily from teacher responses to a
questionnaire and in interviews. If teachers report that the EDD checklist helped them
make appropriate and confident diagnostic decisions and has the potential to positively
impact diagnosing ESL academic writing skills and improving their instructional
practices, it will support this assumption. However, if the checklist does not function as
intended and its use is thought to bring about potentially negative consequences, its
score-based interpretation might not be valid.
Research Questions
The purposes of this research were (a) to develop a new diagnostic assessment
scheme called the Empirically-derived Descriptor-based Diagnostic (EDD) checklist to
assess ESL academic writing skills, and (b) to validate the checklist‟s score-based
interpretations and uses using multiple data sources and from diverse perspectives.
Argument-based approaches to validity provided an overarching logical framework that
guided the development of the EDD checklist and justified its score-based interpretations
and uses. The five assumptions addressing the different aspects of interpretive arguments
were subsequently used to formulate the central research questions of this study:
1) What empirically-derived diagnostic descriptors are relevant to the construct
of ESL academic writing?
2) How generalizable are the scores derived from the EDD checklist across
different teachers and essay prompts?
3) How is performance on the EDD checklist related to performance on other
measures of ESL academic writing?
4) What are the characteristics of the diagnostic ESL academic writing skill
profiles generated by the EDD checklist?
5) To what extent does the EDD checklist help teachers make appropriate
diagnostic decisions and have the potential to positively impact teaching and
learning ESL academic writing?
![Page 25: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/25.jpg)
13
Significance of the Study
This study will make significant contributions to theories of diagnostic L2
writing assessment and will have direct implications for instructional practices. Four
research areas are of particular relevance: (a) identification of the ESL writing construct,
(b) development of a diagnostic ESL writing assessment scheme, (c) application of
psychometric diagnostic models to performance assessment, and (d) integration of
feedback research in L2 writing and the diagnostic approach in educational assessment.
First and foremost, this study will enable researchers and test developers to
better understand the construct of ESL writing. Despite abundant research in L2 writing
theories, few studies to date have attempted to identify the latent structure of ESL writing
using both substantive and statistical approaches. This study has empirically identified
assessment criteria using ESL teachers‟ think-aloud verbal protocols, and has tested their
dimensional structure using a series of conditional covariance-based nonparametric
dimensionality techniques. The findings derived from these analyses will enrich theories
of ESL writing and will provide more specific direction for ESL writing assessment.
Second, this study reconceptualizes the current classification of L2 writing scales.
Despite the increasing need for diagnostic assessment, very few scales (e.g., Knoch‟s
[2007] diagnostic ESL academic writing scale) have been developed to diagnose students‟
L2 writing performance. In addition, although a few researchers (e.g., Pollitt & Murray,
1996) have proposed diagnosis-oriented rating scales, these ideas have not been fully
realized within the context of L2 writing assessment. This study responds to the need for
research in this area, and contributes to the current L2 writing scale literature by
developing a diagnostic ESL writing assessment scheme and validating its use.
Third, this study demonstrates the ways in which a psychometric diagnostic
model can be applied to performance assessment. Despite an increasing interest in
assessing productive language skills, the current applications of diagnostic models have
been limited to multiple-choice tests that measure only receptive language skills. This
limited approach has prevented a thorough investigation of students‟ speaking and
writing performance, and has resulted in only a few studies focused on reading and
listening. The ways in which this study has overcome this constraint are unique, and can
be extended to other diagnostic performance assessments.
![Page 26: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/26.jpg)
14
Finally, this study fills a gap that exists between feedback research in L2 writing
and the diagnostic approach in educational assessment. Despite the same overarching
goal, the research focus in these two areas has been directed in different ways. Most
feedback research in L2 writing examines the effect of different types of feedback on L2
writing using a qualitative method or case studies, while diagnostic educational
assessment is focused primarily on developing and implementing a psychometric
diagnostic model using large-scale test data. This study expands the scope of feedback
research in L2 writing by introducing a new measurement technique and opening an
avenue for much-needed additional research.
Chapter Overview
There are seven chapters in this thesis. Chapter 1 provides an overview of the
research problem, focusing on the five validity assumptions that guided the checklist‟s
development and validation. Chapter 2 reviews relevant literature, giving special
attention to the theoretical frameworks of L2 writing assessment and diagnostic
assessment. Chapter 3 describes the methodology used in this study, and provides
information about participants, instruments, and data collection and analysis procedures.
Chapter 4 discusses the ways in which the EDD checklist was developed, and presents
the final checklist. Chapters 5 and 6 report the checklist‟s evaluation outcomes. Finally,
Chapter 7 synthesizes the research findings and discusses areas of future research.
Definitions of the key terms used in this study are provided in Appendix A.
![Page 27: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/27.jpg)
15
CHAPTER 2
REVIEW OF LITERATURE
Approaches to L2 Writing Assessment
Demystifying the Construct of L2 Writing
Writing in a second language (L2) is a multi-faceted and complicated language
skill. A variety of linguistic and non-linguistic components constitute the construct of L2
writing, and text- and writer-related variables directly or indirectly interact with writing
processes and products. Numerous attempts have been made to define the construct of L2
writing and to assess L2 writing ability, but no all-encompassing framework has yet been
described (Cumming, 1998, 2001, 2002; Cumming, Kantor, Powers, Santos, & Taylor,
2000; Grabe, 2001). As Cumming (2001) noted:
Unfortunately, as we all know, there is no generally agreed-on definition of this
construct, let alone any substantiated model that is vying for this status. I know
all too well myself, from having tried over several years to start to construct,
with little empirical success, such a model in one setting (see Cumming & Riazi,
2000). Moreover, in recently reviewing the past 5 years‟ published research, … I
was only able to affirm that research has recently highlighted the
multidimensionality of L2 writing. (p. 214)
This view on the multidimensional nature of L2 writing was highlighted in the
development of a framework for the writing subtest of the 2000 Test of English as a
Foreign Language (TOEFL). Cumming et al. (2000) framed the test‟s guiding principle
by exploring multiple facets of a workable writing conception rather than a rigorous
writing construct, thereby realistically approaching what L2 writing ability really is.
Grabe‟s (2001) perspective differed slightly, relying on theoretical models that
have explanatory and predictive power to describe writing performance in a particular
setting. Although he concluded that these theories were limited to functioning as an
overarching framework for an L2 writing construct, they do seem to provide useful
insight into how L2 writing ability is organized and conceptualized. Two positions on
writing-as-process are worth particular mention: the cognitive view (e.g., Bereiter &
Scardamalia, 1987; Flower & Hayes, 1981; Kellogg, 1996) and the socio-contextual view
(e.g., Grabe & Kaplan, 1996; Hamp-Lyons & Kroll, 1997; Hayes, 1996; Sperling, 1996).
Flower and Hayes (1981) characterized writing as a cognitively complex mental act in
![Page 28: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/28.jpg)
16
interaction with three sub-processes: planning, translating, and reviewing. This view
assumes that writing occurs in a nonlinear and recursive manner, with overlapping
process components. The socio-contextual view of writing, on the other hand, expands on
the cognitive model by taking additional variables that could affect writing performance
into account. Hayes (1996) reframed writing as an individual-environmental interaction
by focusing on such individual components as motivation and affect, cognitive processes,
working memory and long-term memory, and on such environmental components as
audience and writing task and the medium of writing. His view on writing is clearly
illustrated as follows:
Indeed, writing depends on an appropriate combination of cognitive, affective,
social, and physical conditions if it is to happen at all. Writing is a
communicative act that requires a social context and a medium. It is a generative
activity requiring motivation, and it is an intellectual activity requiring cognitive
processes and memory. No theory can be complete that does not include all of
these components. (p. 5)
In an attempt to organize the parameters involved in writing into a set, Grabe and
Kaplan (1996) proposed a detailed taxonomy of writing skills, knowledge bases, and
processes built on two theories: communicative competence (Bachman, 1990; Canale &
Swain, 1980) and ethnography of writing. The taxonomy was developed by identifying
situation variables such as settings, tasks, tests and topics and integrating them with
writer variables such as linguistic, discourse, sociolinguistic skills and strategies. Grabe
and Kaplan suggested that this taxonomic approach could provide valuable insights to
researchers, since most writing research is conducted without full consideration of factors
that could affect writing processes and outcomes.
Although these theoretical models contributed greatly to a general understanding
of how writing is organized and conceptualized, they originated in L1 writing
development, a context with limited applications in L2 writing (Grabe, 2001).
Acknowledging the absence of L2-specific models of writing, Silva (1990) suggested
that (a) L2 writing theory, (b) research on the nature of L2 writing, (c) research on L2
writing instruction, (d) L2 writing instruction theory, and (e) L2 writing instruction
practice should be integrated in such model building.
Cumming (1997) and Leki, Cumming, and Silva (2008) looked at the problem
from a somewhat different perspective. Instead of relying on unsubstantiated theories,
![Page 29: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/29.jpg)
17
they presented several empirical approaches for defining and validating L2 writing ability.
One approach is to analyze the characteristics of written compositions by utilizing such
discourse analytic measures as morphological and syntactic features, and lexical and
grammatical errors. Another approach focuses on the rater perceptions and behaviors in
order to verify existing rating scales or empirically explore evaluation criteria (Connor &
Carrell, 1993; Cumming, 1990; Cumming, Kantor, & Powers, 2001, 2002; Lumley, 2002,
2005; Milanovic, Saville, & Shuhong, 1996; Sakyi, 2000; Smith, 2000; Vaughan, 1991).
These two approaches originated in two different areas of research: second language
acquisition (SLA) and language testing (LT), respectively.
The review of current L2 writing research suggests that, despite a concerted
effort to define the construct of L2 writing, no single theory explains what L2 writing
ability is and how it interacts with other cognitive and contextual variables; however, the
two methodological approaches related to discourse analysis and rater perceptions
frequently appear to be used to determine the qualities and dimensions of L2 writing. If
the construct of L2 writing can be reliably and validly operationalized using these
methods, valid inferences can be made about students‟ L2 writing ability.
Discourse Analytic Approach
Discourse analytic measures or objective measures (such as the number of T-
units, error-free clauses per T-unit, etc.) are increasingly used as a means of quantifying
the quality of L2 writing, and are believed to be reliable indicators of L2 writing
proficiency.2 These measures enable researchers to quantitatively describe observable
characteristics or qualities of writing performance by tallying the frequencies or
calculating the ratios of certain linguistic features that occur in a written corpus. Many
objective measures are conceptualized to consist of theoretical taxonomies that help to
gauge the subcomponents of L2 writing ability. For example, Wolfe-Quintero, Inagaki,
and Kim (1998) conducted a comprehensive analysis investigating the relationship
between L2 writing development and the frequencies, ratios, and indexes of accuracy,
fluency, and complexity measures. Acknowledging that most such measures tend to be
2 Hunt (1970) described T units as “the shortest units into which a piece of discourse can be cut without
leaving any sentence fragments as residue” (p. 188).
![Page 30: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/30.jpg)
18
used in a more impressionistic than theoretical manner, they reviewed those that were
used in 39 studies of second or foreign language writing and attempted to identify the
most reliable and valid indicators of development in L2 writing. Wolfe-Quintero et al.
hypothesized that a linear progression of these measures would indicate increasing L2
writing proficiency and operationalized this proficiency to include such variables as
program levels, school levels, classroom grades, standardized tests, rating scales,
comparison with native speakers, and short-term changes in classes.
Wolfe-Quintero et al.‟s (1998) review of the discourse analytic approach
suggests that accuracy is the most researched of all the measures. Even though there is a
lack of consensus among SLA researchers with regard to how to define and
operationalize this concept (Arnaud, 1992; Casanave, 1994; Homburg, 1984; Larsen-
Freeman, 1978, 1983; Larsen-Freeman & Strom, 1977; Perkins, 1980; Vann, 1979; also
see Polio‟s [1997] extensive review on linguistic accuracy in L2 writing research), the
notion of freedom from error (Foster & Skehan, 1996) seems to be the most widely
accepted definition. Errors have been identified in various ways by researchers. Bardovi-
Harlig and Bofman (1989) classified them as syntactic, morphological, or lexical-
idiomatic, while Nas (1975; as cited in Homburg, 1984) categorized errors as first-,
second-, and third-degree, based on their gravity. Methods of counting written errors are
also many and varied. According to Wolfe-Quintero et al. (1998), two approaches are
prevalent: focus on the number of error-free production units (e.g., error-free T-units,
error-free clauses, etc) and on the number of errors that occur within certain production
units (e.g., errors per clause, grammatical errors per word, etc). After reviewing these
measures extensively, they suggest that errors per T-unit and error-free T-units per T-unit
are the most useful with regard to determining accuracy in L2 writing.
Less research has been done on fluency measures, possibly because its unique
nature (i.e., automaticity) is difficult to gauge in written communication. Indeed, Polio
(2001) questions whether fluency has any relation at all to quality of writing. Still, the
extent to which a writer can fluently produce written language has been quantified using
several measures. Of these, two analytic methods have drawn the interest of researchers:
frequency techniques that count the number of words, verbs, clauses sentences, etc and
ratio techniques that calculate the number of words per clause, words per sentence, words
![Page 31: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/31.jpg)
19
per T-unit, etc (Wolfe-Quintero et al., 1998). Despite the popularity of both of these
methods in SLA and L2 writing studies, Wolfe-Quintero et al. (1998) suggest that ratio
measures are more effective than frequency measures in assessing L2 writing
performance, and that T-unit length (i.e., words per T-unit), error-free T-unit length (i.e.,
words per error-free T-unit), and clause length (i.e., words per clause) are the three most
useful indicators of L2 writing, regardless of writing task, target language, or the ways in
which L2 writing proficiency level is determined.
Complexity, which has been theorized to encompass multiple dimensions of
variation, density, and sophistication, has been examined from both grammatical
(Bardovi-Harlig, 1992; Bardovi-Harlig & Bofman, 1989; Casanave, 1994; Cooper, 1976,
1981; Hinkel, 2003; Homburg, 1984; Ishikawa, 1995; Kameen, 1979; Monroe, 1975;
Shaw & Liu, 1998; Vann, 1979) and lexical perspectives (Engber, 1995; Harley & King,
1989; Hinkel, 2003; Laufer, 1991; Laufer & Nation, 1995; Linnarud, 1986; McClure,
1991; Shaw & Liu, 1998). While the grammatical complexity of writing has been judged
primarily by the presence of specific grammatical features (e.g., passives, adverbial
clauses, nominal clauses, etc) or the ratios of those specific grammatical features within
certain production units (e.g., adverbial clauses per T-unit, coordinate clauses per T-unit,
passives per sentences, etc), lexical richness has tended to be assessed by ratio measures
(Wolfe-Quintero et al., 1998). According to Wolfe-Quintero et al. (1998), three types of
ratio measures are of specific interest: type/token ratios (e.g., word types per words, verb
types per verbs, etc), type/type ratios (e.g., sophisticated word types per word types,
basic word types per word types, etc), and token/token ratios (lexical words per words,
sophisticated lexical words per lexical words, etc). They reported that clauses per T-unit
and dependent clauses per clause were significantly related to the grammatical
complexity of L2 writing, and a word variation measure (i.e., total number of different
word types divided by the square root of two times the total number of words) and a
lexical sophistication measure (i.e., total number of sophisticated word types divided by
total number of word types) were significantly related lexical complexity measures.
Research has also focused on ways in which textual features that extend across
sentence boundaries can be quantified, particularly the extent to which textual structure is
tied together in extended discourse. This concept of cohesion and coherence has given
![Page 32: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/32.jpg)
20
rise to a large body of research into the pertinent measures. Cohesion refers to explicit
linguistic cues that indicate interrelations between different parts of discourse (Reid,
1992), whereas coherence is the much broader and more complicated interaction in a
writer‟s cognitive processes (Beaman, 1984; as cited in Reid, 1992); cohesion is thus
regarded as a subcomponent of coherence (Halliday & Hasan, 1976; McCulley, 1985;
Yule, 1985). In a seminal publication entitled Cohesion in English, Halliday and Hasan
(1976) discuss the taxonomy of cohesive devices, which specifies five different types of
cohesive ties: substitution, ellipsis, reference, conjunction, and lexical cohesion (see their
work for a more in-depth discussion of these cohesive ties). The influence of this
pioneering work has resulted in a great deal of research to quantify the extent to which a
text holds together and to identify cohesive characteristics that differentiate good and
poor writing (Crowhurst, 1987; Evola, Mamer, & Lentz, 1980; Fitzgerald & Spiegel,
1986; Jafarpur, 1991; McCulley, 1985; Neuner, 1987; Reid, 1992; Tierney & Mosenthal,
1983; Witte & Faigley, 1981). The results of these studies have proven to be mixed; for
example, Witte and Faigley (1981) suggested that good essays tend to show a higher
density level in cohesion than poor essays, but Neuner (1987) found that good writers
used none of the cohesive devices more frequently than did poor writers.
The more complicated intersentential relationship, known as coherence, has been
examined using three major approaches. The first approach involves Vande Kopple‟s
(1985) different types of metadiscourse: text connectives, code glosses, illocution
markers, narrators, validity markers, attitude markers, and commentaries (for a more in-
depth discussion of metadiscourse types, see Vande Kopple, 1985). In a study comparing
good and poor ESL essays, Intaraprawat and Steffensen (1995) found that good essays
more often contained all types of metadiscourse features than did poor essays, and good
writers also utilized a wider range of metadiscourse markers in their writing. Cross-
cultural differences in metadiscourse use have been examined by Crismore, Markkanen,
and Steffensen (1993), who suggested that Finnish students have a higher density of
metadiscourse and use a hedging device more frequently than U.S. students.
The second approach focuses on how discourse topics develop through
sequences of sentences. Lautamatti (1978, 1987) developed a procedure called topical
structure analysis (TSA) to characterize the nature of coherence within texts and
![Page 33: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/33.jpg)
21
identified three topical progressions: parallel, sequential, and extended parallel (for a
more in-depth discussion of TSA, see Lautamatti, 1978; 1987). In a study that applied
TSA to L1 writing, Witte (1983a) found that more proficient writers tended to use
parallel and extended parallel progressions more often than less proficient writers;
conversely, less proficient writers tended to use more sequential progression. On the
other hand, Schneider and Connor‟s (1990) L2 writing research study reported that high-
rated essays contained more sequential progression, while intermediate- and low-rated
essays contained more parallel progression. They suggested that inconsistent findings
might be attributed to different coding schemes and little information about reliability.
The final approach associated with coherence measures is topic-based analysis
(Watson Todd, 1998; Watson Todd, Thienpermpool, & Keyuravong, 2004). In searching
for an objective measure of coherence, Watson Todd (1998) and Watson Todd et al. (2004)
proposed topic-based analysis consisting of multiple procedures: (a) identifying key
concepts, (b) identifying relationships between key concepts, (c) linking relationships
into a hierarchy, (d) mapping discourse onto the hierarchy, and (e) identifying topics and
measuring coherence. Based upon an analysis of 28 written compositions, Watson Todd
et al. suggested that coherence evaluated using this methodology correlated closely with
coherence marks assigned by teachers.
The review of the discourse analysis approach suggests that most SLA studies
focus on (a) accuracy, (b) fluency, (c) complexity, (d) cohesion and (e) coherence, while
conceptualizing L2 writing ability. However, a careful examination of this approach
indicates that discourse analysis does not address such non-linguistic aspects of L2
writing as content relevance, effectiveness, originality, or creativity. This method is
therefore rather limited in its capacity to explain all of the factors that could affect L2
writing competence. As Péry-Woodley (1991) noted, “researchers became much more
cautious not to establish over-simplistic links between surface features of texts and
language development outside of discourse considerations, and adopted a more skeptical
and critical stance toward such as maturity and complexity” (p. 73-74).
Caution should also be used when objective measures are utilized in assessment.
As Ishikawa (1995) and Perkins (1983) argued, discourse analysis can be both time-
consuming and inefficient, particularly for classroom assessments. Even when teachers
![Page 34: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/34.jpg)
22
take the time necessary to objectively measure their students‟ writing, it is not always
clear how the results can help students understand their own L2 writing ability. What
does a high score on “words in T-units” mean practically? What is the importance of a
high score on “third-degree errors” versus a low score on “lexical accuracy index”?
Should specific scores on such measures be an instructional goal? The granularity of
these measures is too small to be used in a realistic instructional setting, and profiling L2
ability in this way would therefore not be beneficial for teachers and students. As
Alderson (2005) pointed out, a discourse analysis approach focusing on a narrowly-
defined aspect of grammatical or morphological rules might not be the best way of
diagnosing L2 writing ability.
Rater Perceptions and Rating Scales
Rater perceptions on L2 writing.
Another way of examining L2 writing ability is by looking at rater perceptions
and rating scales. Most studies in this line of research utilized think-aloud verbal
protocols to determine rater scoring behaviors or processes, to empirically explore the
assessment criteria that they use, or to verify the accuracy of existing rating scales
(Connor & Carrell, 1993; Cumming, 1990; Cumming, Kantor, & Powers, 2001, 2002;
Lumley, 2002, 2005; Milanovic, Saville, & Shuhong, 1996; Sakyi, 2000; Smith, 2000;
Vaughan, 1991). Cumming (1990) identified 28 decision-making and assessment criteria
used by experienced assessors to evaluate L2 written compositions. These were
categorized into four foci (self-control, content, language, and organization) and two
strategies (interpretation and judgment). Each focus contained subcriteria further
specifying rater evaluation behaviors or criteria. For example, a focus on language was
broken down into (a) classifying errors, (b) editing phrases, (c) establishing level of
comprehensibility, (d) establishing error frequency, (e) establishing command of
syntactic complexity, (f) establishing appropriateness of lexis, and (g) rating overall
language use. Similar criteria were found in the new TOEFL. Cumming et al. (2001,
2002) documented 27 decision-making processes exhibited by experienced writing
assessors on ESL/EFL compositions; these were further characterized by three foci (self-
monitoring, rhetorical and ideational, and language) and two strategies (interpretation
![Page 35: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/35.jpg)
23
and judgment).
Research interest has also been directed toward the ways in which evaluation
criteria for a rating scale could interact with rater perceptions and judgments. In a pivotal
study by Vaughan (1991), nine raters verbalized their thinking processes during the rating
process using a six-point holistic scale. Raters‟ comments were categorized into 14
general evaluation criteria, and the six most frequently mentioned assessment elements
were identified as (a) quality of content, (b) legibility of handwriting, (c) tense/verb
problem, (d) punctuation/capitalization error, (e) quality of introduction, and (f)
morphology/word form error. Of these, Vaughan found that raters most frequently
focused on content problems.
In a large-scale EFL testing context involving two Cambridge examinations
(First Certificate in English [FCE] and Certificate of Proficiency in English [CPE]),
Milanovic, Saville, and Shuhong (1996) asked 16 raters from diverse backgrounds to
report the evaluation components that they focused on when assessing EFL writing. A
wide range of elements were identified, including (a) length, (b) legibility, (c) grammar,
(d) structure, (e) communicative effectiveness, (f) tone, (g) vocabulary, (h) spelling, (i)
content, (j) task realization, and (k) punctuation. They also found that raters focused
more on vocabulary and content in high-level essays, and on communicative
effectiveness and task realization in intermediate-level essays.
Similar findings were reported by Smith (2000), who examined the ways in
which raters interpret and apply evaluation criteria in the Certificates in Spoken and
Written English (CSWE). Based upon six raters‟ verbal accounts, nine textual features
were identified that described the examinees‟ writing performance: (a) grammar, (b)
organization, (c) cohesion, (d) sentence structure, (e) punctuation/capitalization, (f)
spelling, (g) handwriting, (h) length of text, and (i) lexical choice. Conversely, the study
by Sakyi (2000) sought more global assessment criteria. Six raters were asked to describe
their rating processes using a five-point scale, with their comments categorized as
focusing on (a) content and organization, (b) grammatical and mechanical errors, and (c)
sentence structure and vocabulary.
In a more recent study, Lumley (2002) examined the ways in which four
experienced raters applied a rating scale on L2 written compositions. The scale provided
![Page 36: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/36.jpg)
24
to the raters was developed for the writing subtest of the Special Test of English
Proficiency (STEP), and had four evaluation criteria: (a) task fulfillment and
appropriateness, (b) conventions of presentation, (c) cohesion and organization, and (d)
grammatical control. The findings indicated that even though the scale content seemed to
accurately reflect what raters pay attention to, there were conflicts among the descriptors
within the same criteria at the same level. The raters also focused on two additional
evaluation criteria (quantity of ideas and explicit cohesive devices) that were not
included in the STEP rating scale.
The analysis of rater perceptions indicates that there is some consensus on which
aspects of L2 writing ability should be assessed. Typically, three elements, (a) content, (b)
language use, and (c) organization, were consistently used. While the substance of the
construct of L2 writing using these elements is essentially the same, the ways it is
referred to may differ. For example, when written content was the focus of raters‟
assessments, it was called quality of content (Vaughan, 1991), quantity of ideas (Lumley,
2002), or task realization (Milanovic et al., 1996). Organization was also variably
referred to as structure (Milanovic et al., 1996) and use of explicit cohesive devices
(Lumley, 2002). Language use might be the case that showed a wide range of granularity.
It generally included grammatical, lexical, and mechanical features, but the grain size of
the features differed drastically. For example, Vaughan (1991) was more specific than
Smith (2000), breaking grammatical errors into smaller units such as tense and verb
problems. It is interesting to note that raters also paid attention to legibility of
handwriting, which would seem to be a construct-irrelevant factor of L2 writing.
Rating scales in L2 academic writing.
The construct of L2 writing can also be approached by examining existing rating
scales. Rating scales represent the underlying construct of a test and help raters to focus
on the skills or abilities intended to be assessed (Luoma, 2004; McNamara, 1996; Weigle,
2002). A content analysis of existing rating scales should therefore provide a good basis
for understanding the multi-faceted and complicated construct of L2 writing.
In a large-scale testing setting, the TOEFL is perhaps the best-known of all ESL
academic tests. It assesses the writing ability required in an academic setting, while its
![Page 37: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/37.jpg)
25
rating scale scores the overall quality of the writing based on (a) development, (b)
organization, and (c) appropriate and precise use of grammar and vocabulary
(Educational Testing Service, 2007).3 Another well-known ESL test is the International
English Language Testing System (IELTS), in which academic writing tasks are scored
based on (a) task achievement, (b) coherence and cohesion, (c) lexical resource, and (d)
grammatical range and accuracy (University of Cambridge, British Council, & IELTS
Australia, 2007). The Michigan English Language Assessment Battery (MELAB) has
similar evaluation criteria: (a) clarity and overall effectiveness, (b) topic development, (c)
organization, and (d) the range, accuracy, and appropriateness of grammar and
vocabulary (University of Michigan, 2003).
In a classroom assessment context, the rating scale created by Jacobs, Zinkgraf,
Wormuth, Hartfiel, and Hughey (1981) might be the best-known and comprehensive. It
evaluates ESL written compositions based on (a) content, (b) organization, (c) vocabulary,
(d) language use, and (e) mechanics. The unique characteristic of this rating scale is that
each major criterion has fine-grained subcriteria; for example, effectiveness of language
use is assessed by the elements associated with syntactic structure, errors of agreement,
tense, number, word order/function, articles, pronouns, and prepositions.4
Most rating scales thus appear to have similar evaluation criteria (i.e., content,
language use, organization), but slightly different grain sizes. Different rating scales
might tap fundamentally the same underlying construct of L2 academic writing, but with
different levels of specificity. For example, the TOEFL rating scale assesses language-
specific factors using one general criterion, while the IELTS and Jacobs et al. use finer-
grained criteria such as (a) vocabulary, (b) grammatical range and accuracy, and (c)
mechanics. Different wordings or terminologies may also be used to describe evaluation
features that essentially refer to the same component. While the IELTS defines written
structure as coherence and cohesion, the TOEFL and MELAB define it as development
and organization.
3 Although the new TOEFL contains two types of writing tasks (integrated and independent), it is the
rating scale for independent writing tasks that is discussed in this section. A discussion of the integrated
writing tasks is beyond the scope of this thesis. 4 In Jacobs et al.‟s (1981) scale, language use focuses primarily on the use of grammatical knowledge in
written text.
![Page 38: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/38.jpg)
26
Summary
Efforts to explain the construct of L2 writing have been made based upon
theoretical accounts, discourse analysis, and rater perceptions and rating scales. These
three approaches are synthesized in Table 1. Although construct elements can be broken
into even smaller units, it has not been done to enhance comparability across different
approaches. Construct elements are presented in a broad scheme; yet, their granularity
differs widely across approaches.
Grabe and Kaplan‟s (1996) theoretical taxonomy is unique in that it focuses not
only on linguistic discourse skills and strategies, but also on sociolinguistic aspects of
writing ability. Detailed accounts are given for each skill and strategy category in their
taxonomy. In discourse analysis approaches, however, L2 writing ability is determined
based upon (a) accuracy, (b) fluency, (c) complexity, (d) coherence, and (e) cohesion.
Such objective measures include errors per T-unit, words per T-unit, clauses per T-unit,
and number of discourse markers, and so on. The primary limitation of the discourse
analytic method is that it cannot conceptualize such non-linguistic aspects of L2 writing
as content relevance, originality, or creativity. The grain size of these measures is also too
small to be used in a realistic instructional setting.
From an assessment perspective, the underlying construct of L2 writing is
organized based on rater perceptions and rating scales, and the approaches utilizing rater
perceptions and rating scales seem more useful than the other approaches in an
assessment context. Although some variations in the specificity of evaluation criteria
exist, they tap the same fundamental substances of L2 writing ability: (a) content, (b)
language use, and (c) organization. It is noteworthy that most large-scale, institutional
rating scales (e.g., TOEFL and IELTS) do not consider text length and handwriting as
critical evaluation criteria, while raters often do (e.g., Milanovic et al., 1996; Smith, 2000;
Vaughan, 1991). This raises the interesting question of whether text length and
handwriting should be considered construct-relevant factors.
Despite different theoretical orientations, the three approaches provide
convergent evidence as to how the construct of L2 writing is defined and operationalized,
assessing content, organization, and language use (vocabulary, grammar, and mechanics)
as a common denominator. However, it should be noted that the nature of L2 writing is
![Page 39: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/39.jpg)
27
malleable rather than fixed, and embodied within a specific context. Defining a construct
without considering its contextual variables would be useless. As Cumming et al. (2000)
insightfully noted:
Although educators around the world regularly work with implicit
understandings of what constitutes effective English writing, no existing research
or testing programs have proposed or verified a specific model of this, such as
would be universally accepted. Indeed, current ESL/EFL writing tests operate
with generic rating scales that can reliably guide the scoring of compositions but
which fail to define the exact attributes of examinees‟ texts or the precise basis
on which they vary from one another [italics added]. (p. 27)
The next section will continue the discussion of theoretical and empirical issues
associated with rating scales and their development.
![Page 40: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/40.jpg)
28
Table 1
Synthesis of Writing Construct Elements
Construct
element
Grabe &
Kaplan
(1996)
Discourse
analytic
method
Cumming
et al.
(2002)
Vaughan
(1991)
Milanovic
et al.
(1996)
Smith
(2000)
Sakyi
(2000)
Lumley
(2002)
TOEFL
(2007)
IELTS
(2007)
MELAB
(2003)
Jacobs
et al.
(1981)
Sociolinguistic
knowledge
Content
Language use
Vocabulary
Grammar
Mechanics
Organization
Text length
Handwriting
![Page 41: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/41.jpg)
29
Demystifying Rating Scales
Rating Scale Types
Rating scales have long been used in performance assessment, and are widely
considered to be useful tools for judging language performance. Rating scales provide
common metric systems or standards that enable comparisons across different languages
and contexts (Bachman & Savignon, 1986). Additionally, as many researchers have
suggested (e.g., Luoma, 2004; McNamara, 1996; Weigle, 2002), they function as a
blueprint that specifies what skills or abilities should be assessed, and further represent
the underlying construct that the test aims to assess. According to Davies, Brown, Elder,
Hill, Lumley, and McNamara (1999), a rating scale can be defined as follows:
A scale for the description of language proficiency consisting of a series of
constructed levels against which a language learner‟s performance is judged.
Like a test, a proficiency (rating) scale provides an operational definition of a
linguistic construct such as proficiency. Typically such scales range from zero
mastery through to an end-point representing the well-educated native speaker.
The levels or bands are commonly characterized in terms of what subjects can do
with the language (tasks and functions which can be performed) and their
mastery of linguistic features (such as vocabulary, syntax, fluency and cohesion).
(p. 153-154)
As this indicates, rating scales are typically expressed in numerical values or descriptive
statements to assess performance on a particular task. In order for such scores to be
meaningful, rating scales should be associated with not only the language constructs to
be assessed, but with the purposes and the audiences for the assessment within a specific
context (Alderson, 1991; Luoma, 2004).
Rating scales can be classified in a variety of ways. Alderson (1991) divided
them into three types, according to purpose, as user-oriented, assessor-oriented, and
constructor-oriented. A user-oriented scale allows those who are interested in using the
ratings (school or job applicants, admission officers, and so on) to interpret the meanings
of the reported ratings, while an assessor-oriented scale is developed to guide assessors‟
rating processes by specifying the ways in which performance features should be rated. A
constructor-oriented scale, on the other hand, provides test constructors with the guiding
specifications a test should contain. Luoma (2004) has similar rating scale classifications:
rater-oriented, examinee-oriented, and administrator-oriented. A rater-oriented scale helps
raters to make decisions, while an examinee-oriented scale provides performance
![Page 42: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/42.jpg)
30
information about examinees‟ strengths and weaknesses. Finally, an administrator-
oriented scale provides concise overall performance information.
Brindley (1998) takes a slightly different view, distinguishing between behavior-
based and theory-derived rating scales. A behavior-based scale describes features of
language use within a specific context, whereas a theory-derived scale describes
language ability without dependence on specific content and context. This classification
is conceptually consistent with what Bachman (1990) calls “real-life” and “interactive-
ability” approaches (p. 325). Within Bachman‟s framework, a real-life scale would view
language ability as a unitary concept, not distinguishing the ability to be assessed from
the characteristics of the context in which language performance is elicited. An
interactive-ability scale views language ability as a multi-componential construct,
measuring language ability with no consideration for particular contextual features
(Bachman, 1990).
Rating scales are also varied in terms of scoring approaches. Cooper (1977)
differentiated between holistic and analytic evaluation, stating that holistic evaluation
refers to “any procedure which stops short of enumerating linguistic, rhetorical, or
informational features of a piece of writing” (p. 4). Analytic evaluation, on the other hand,
involves counting and tallying occurrences of particular linguistic features. He also
characterized holistic evaluation as a quick and impressionistic procedure for placing,
scoring, or grading written texts, and proposed several types of holistic evaluation,
including dichotomous scale, primary trait scoring, and general impression marking.
Weigle (2002) classifies different types of rating scales on the basis of generalizability
and the use of single or multiple scores; holistic and analytic rating scales are intended to
be generalized across writing tasks, but differ in that one provides a single score or
multiple scores. A primary trait rating scale, on the other hand, yields a single score by
focusing on one very specific writing feature.
Holistic rating scales, also known as a global or impressionistic rating scale,
assume that language ability is a single unitary ability (Bachman & Palmer, 1996) and
that a score for the whole is not equal to the sum of separate scores for the parts (Goulden,
1992). When using a holistic rating scale to assess writing performance, raters usually
take note of various aspects of a written text simultaneously, assigning a single score that
![Page 43: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/43.jpg)
31
best reflects their general impression of that text. Holistic rating scales in L2 writing
include the American Council of the Teaching of Foreign Languages (ACTFL)
Proficiency Guidelines (American Council of the Teaching of Foreign Languages
[ACTFL], 2001), the TOEFL rating scale (ETS, 2007), the IELTS rating scale
(University of Cambridge, British Council, & IELTS Australia, 2007), and the MELAB
rating scale (University of Michigan, 2003). First published in 1986, the ACTFL Writing
Proficiency Guidelines define and measure a learner‟s functional writing competence
using a nine-point rating scale that ranges from Novice to Superior. The Guidelines
describe positive rather than negative aspects of writing for each level, focusing on the
kinds of tasks writers can do with their respective writing proficiency. The nature of the
TOEFL writing scale is slightly different from that of the ACTFL Guidelines. The
TOEFL writing subtest is intended to assess L2 academic writing ability in a large-scale
assessment setting, and contains six holistic score levels. Overall quality of writing is
assessed by comprehensively taking a variety of aspects such as content, language use,
and organization into account. Similarly, the IELTS and the MELAB writing scales
produce a single score based on the overall quality of written compositions, and contain 9
and 10 score bands, respectively.
White (1985) may be one of the best-known proponents of holistic rating scales
(Hamp-Lyons, 1991; Weigle, 2002). As he argued, holistic rating scales have several
advantages that other types of rating scales do not. Specifically, they are an economical
and practical means of scoring in that raters usually read a text just once (rather than
several times) and provide a single score (rather than multiple scores) in one minute or
less (Hamp-Lyons, 1991). This speedy rating process is certainly a benefit for raters and
testing agencies interested in saving time and money. Holistic rating scales also allow
raters to focus on the strengths of writing samples rather than their weaknesses, enabling
writers to be evaluated according to what they have done well (Weigle, 2002; White,
1985). Finally, holistic rating scales represent a humanistic approach to understanding the
authentic nature of writing versus “analytic reductionism” (White, 1985, p. 33).
Therefore, a holistic approach makes it possible to appreciate writing as a unified and
central human activity, not as segments split into detached activities (White, 1985).
![Page 44: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/44.jpg)
32
Despite its apparent advantages, holistic rating scales have been severely
criticized for several reasons. One major weakness is the inability to supply diagnostic
information about writers‟ strengths and weaknesses beyond a relative rank-ordering
(Charney, 1984; Davies et al., 1999; Hamp-Lyons, 1991; Luoma, 2004; Weigle, 2002;
White, 1985). A given writer might be good at producing grammatically correct
sentences, but not at developing content or organizing thesis sentences. In a case like this,
a single score placing a writer‟s performance at a typical level of ability cannot
accurately identify what he or she has done well or poorly. The inability to accurately
illustrate the multi-faceted nature of writing is more problematic in an L2 context, since
fine-grained diagnostic feedback is necessary for L2 writers‟ interlingual development
(Hamp-Lyons, 1991; Weigle, 2002). As Hamp-Lyons (1995) rightly argued, “a holistic
scoring system is a closed system, offering no windows through which teachers can look
in and no access points through which researchers can enter” (p. 760-761).
Another criticism of holistic ratings lies in the difficulty of matching writing
texts to appropriate levels on a scale (Bachman & Palmer, 1996). It is often the case that
all the evaluation criteria of a holistic scale are not met concurrently, so that a rater must
(whether consciously or unconsciously) prioritize some criteria over others. For example,
when a writing sample matches up with the Level 2 descriptors of a rating scale in terms
of content development but not language use, raters must make a decision about the level
at which the text should be matched. The possibility that raters explicitly or implicitly
weigh particular features of writing is unavoidable, making interpretation of scores even
more difficult (Bachman & Palmer, 1996; Goulden, 1994).
By contrast, analytic rating scales assume that the sum of the separate scores
awarded to subcomponents of writing is equal to a single score awarded for the written
piece as a whole (Goulden, 1992). In an analytic scoring scheme, raters take note of
several aspects of writing and produce multiple ratings or subscores, which are then
weighed according to theoretical considerations or the test developer‟s specifications.
They can also be aggregated as a composite score according to test purposes. For
example, Jacobs et al.‟s (1981) ESL composition profile describes five aspects of writing
ability (content, organization, vocabulary, language use, and mechanics) and gives them
different weights (30, 20, 20, 25 and 5 points, respectively). Subscores might be more
![Page 45: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/45.jpg)
33
useful for students and teachers who want to identify the strengths and weaknesses in the
writing, whereas a composite score would be more useful for admission officers or
employment committees who simply wish to select a candidate with competent writing
skills.
The use of analytic rating scales in written language assessment has several
practical advantages. First, the ratings assigned to each component can be used to
diagnose the relative strengths and weaknesses of written texts (Bachman & Palmer,
1996; Hamp-Lyons, 1991; Hamp-Lyons & Henning, 1991; Weigle, 2002). Whereas a
single, holistic score can obscure variations in writing ability, analytic ratings make them
visible, enabling the construction of writing profiles (Hamp-Lyons & Henning, 1991).
Profile scores are particularly helpful for L2 writers, who are more likely than their L1
counterparts to show an uneven or marked profile across different areas of writing ability
(Hamp-Lyons, 1991; Weigle, 2002). Another advantage of analytic rating scales is their
reliability (Hamp-Lyons, 1991; Hamp-Lyons & Henning, 1991; Huot, 1996; Weigle,
2002); unlike holistic ratings, analytic rating schemes award multiple scores to a single
written text, which ensures high reliability. Finally, analytic rating scales can better
represent raters‟ cognitive processes (Bachman & Palmer, 1996). According to Bachman
and Palmer (1996), raters tend to consider such individual components as grammar,
content and vocabulary even when they are asked to sort written texts according to
overall quality. This behavior supports the use of analytic rating scales because they
reflect the underlying construct of L2 writing.
Although analytic rating scales are favored by many L2 writing experts, it should
be noted that they are time-consuming and expensive (Perkins, 1983; Weigle, 2002;
White, 1985). Because raters are required to make several judgments based on the
criteria specified on the scale, assessments can take longer than ratings that use a holistic
scale. Davies et al. (1999) pointed out that focusing on each separate component of
writing ability can also distract raters from the overall quality of writing samples. From a
theoretical point of view, White (1985) also questioned whether “writing quality is the
result of the accumulation of a series of subskills,” arguing that “writing remains more
than the sum of its parts and that the analytic theory that seeks to define and add up the
subskills is fundamentally flawed” (p.123).
![Page 46: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/46.jpg)
34
Unlike holistic and analytic rating scales, a primary trait rating scale is used to
assess an important writing trait that is required to accomplish a certain writing task
(Lloyd-Jones, 1977). This type of rating scale assumes that writing should be assessed
within a specific context, and that different rating scales should be developed for every
writing task and prompt. A typical primary trait rating scale consists of (a) the
exercise/task, (b) a statement of the primary rhetorical trait to be elicited by the exercise,
(c) a hypothesis about performance on the exercise, (d) a statement of the relationship
between the exercise and the targeted primary trait, (e) a scoring guide, (f) writing
samples, and (g) a justification of scores (Lloyd-Jones, 1977). Lloyd-Jones (1977) argued
that while primary trait rating scales were originally developed to score essays from the
National Assessment of Educational Progress (NAEP), their basic principles can be
applied in other contexts. For example, they can be used to develop summative
assessment of students‟ writing ability, to make curriculum evaluations, or to provide
specific feedback on a particular writing task. As its methodologies imply, the advantage
of a primary trait rating scale is in classroom instruction and assessment (White, 1985).
Teachers can focus on one narrowly-specified feature at a time rather than on all
characteristics of a writing sample, while students can receive a detailed and precise
description of that specific feature. A greater advantage of this scale is that it can
contribute to curriculum development and evaluations (Cooper, 1977; Lloyd-Jones, 1977;
Perkins, 1983; White, 1985). Teachers can adjust their curricula based upon the
information gathered from the primary trait assessment, rendering both teaching and
learning more effective. The direct connection with classroom instruction will also make
a diagnostic approach to writing more likely and valuable.
Nonetheless, even these advantages are diminished when the development
process for primary trait rating scales is taken into account. Lloyd-Jones (1977) reports
that developing such a scale requires not only a substantial theoretical background in
rhetoric but also a great deal of time: preparation of just one exercise can take from 60 to
80 hours. Rating scales also need to be developed anew for each new writing task or
prompt. For these reasons, primary trait rating scales are mostly used for research
purposes or in a large-scale test context such as NAEP (Hamp-Lyons, 1991). Little
information is therefore available on how to apply such scales to L2 writing assessment
![Page 47: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/47.jpg)
35
(Weigle, 2002).
The review of rating scales suggests that rating scale types are associated with
not only the language constructs to be assessed but with the purposes and audiences for
the assessment within a specific context. Scoring methods also determine the type of
rating scale; holistic rating scales are efficient at placing writers into different proficiency
levels, whereas analytic and primary trait rating scales are better at identifying writers‟
strengths and weaknesses. Analytic rating scales are particularly useful for assessing
multiple facets of L2 writing, but cannot provide more fine-grained diagnostic
description beyond subscores across evaluation criteria. In this regard, primary trait
rating scales might be best suited for diagnostic purposes because they provide a detailed
and precise description of a narrowly-specified writing feature within a specific context.
However, their laborious developmental process is a critical problem. This suggests that a
new assessment approach is needed that can maximize the diagnostic gains from which
student-writers benefit. If it requires different forms of assessment protocols from those
of rating scales, they should be further developed and investigated based on both
theoretical and empirical grounds.
Problems with Rating Scales
In spite of the extensive use of rating scales in the area of L2 assessment and
testing, surprisingly little is known about their theoretical and empirical underpinnings.
Accountability is also questionable, as most rating scales originate in committee-
produced systems, and little information is publicly known about their development
procedures. Serious problems are inherent in these rating scales, and a significant amount
of criticism and concern has focused particularly on the ACTFL Guidelines. Although the
ACTFL Guidelines and their predecessors and successors (e.g., the Foreign Service
Institute [FSI], Interagency Language Roundtable [ILR], and Australian Second
Language Proficiency Ratings [ASLPR] scales) have exerted a great deal of influence on
language instruction and assessment for several decades, criticisms are unavoidable.
As many researchers have pointed out, the most serious problem is that it is not
always clear how rating scale descriptors were created (or assembled) and calibrated (e.g.,
Brindley, 1998; Chalhoub-Deville, 1997; de Jong, 1988; Lantolf & Frawley, 1985; North,
![Page 48: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/48.jpg)
36
1993; Pienemann, Johnson, & Brindley, 1988; Upshur & Turner, 1995). Although
proponents of the ACTFL Guidelines explicitly argued that they were developed
experientially “by observing how second language learners progress in the functions they
can express and comprehend” (Liskin-Gasparro, 1984, p. 37; see for comprehensive
historical review on the ACTFL Guidelines), and “based largely on many years of
observation and testing in both the government context and in the academic community”
(Omaggio Hadley, 1993, p. 21), empirical evidence is scant. No information is available
on the ways in which the observational data were collected and analyzed to be
incorporated into the scale descriptors. The theoretical foundations are also shaky;
ironically, most rating scales have not been built on theoretical models of language
development, language ability, or communicative competence, as they argued. In contrast
to Ingram‟s claim (1984, p. 7; as cited in Brindley, 1998, p. 117) that the development of
the ASLPR was drawn on “psycholinguistic studies of second language development and
the intuitions of many years of experience teaching,” Brindley (1998) argues that neither
specific psycholinguistic studies nor SLA theories were taken into account in the
development process of ASLPR. Lack of theoretical and empirical understanding is the
most serious weakness of these rating scales, whereby the terms intuition-based rating
scales or a priori rating scales were derived.5
The way in which the difficulty hierarchy of linguistic criteria was determined in
these rating scales is also questionable (Bernhardt, 1984; Brindley, 1998; Chalhoub-
Deville, 1997; de Jong, 1988; Lantolf & Frawley, 1985; Lee & Musumeci, 1988; North,
1993; Turner & Upshur, 2002). North (1993) pointed out that the problem of “allocating
„key features‟ to levels without a principled basis, tapping into convention and clichés
among teachers and textbook and scale writers” (p. 5). A similar criticism was echoed by
Lantolf and Frawley (1985) and de Jong (1988), who raised questions about the grounds
on which persuading or detecting emotional overtones should be considered more
advanced than problem-solving or getting some main ideas in the ACTFL Guidelines.
Empirical findings about the validity of the difficulty levels have been mixed; although
5 According to Fulcher (1996b, 2003), an intuition-based or a priori method means developing rating
scales based on experts‟ (e.g., experienced teachers, language testers, or language testing specialists in
examination board) intuitive judgments on the development of language proficiency. Existing rating scales,
a teaching syllabus, or a needs analysis are often consulted in the development process.
![Page 49: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/49.jpg)
37
Dandonoli and Henning (1990) found an adequate linear relationship between examinees‟
ability levels and the task difficulty level as put forth in the ACTFL Guidelines, Lee and
Musumeci‟s (1988) study failed to identify such a difficulty hierarchy. In reviewing
Dandonoli and Henning‟s study, Fulcher (1996a) claimed that their arguments for the
ACTFL Guidelines could not be validated due to problematic research designs and
analytical methods.
Another criticism is that the descriptors embedded in these rating scales are
implicitly norm-referenced rather than criterion-referenced for two reasons. First, these
rating scales evaluate L2 mastery versus the mastery of well-educated native speakers
(Bachman & Savignon, 1986; Chalhoub-Deville, 1997; Fulcher, 1987, 1997; Lantolf &
Frawley, 1985). Indeed, Bachman and Savignon (1986) and Lantolf and Frawley (1985)
expressed doubts as to whether “the” native speaker can actually exist: one language can
contain a myriad of dialects, registers, and vocabularies, so identifying features that
define a single, homogeneous group of native speakers is extremely difficult and
problematic. Descriptors on most scales are also written in terms relevant to their
adjacent descriptors. The level of performance is thus gauged by quantifiers (e.g., some,
many, a few, few) and quality indicators (e.g., satisfactorily, effectively, well) so that one
level of performance cannot be interpreted without dependence on the adjacent levels
(Alderson, 1991; Luoma, 2004; Matthews, 1990; North, 1993, 1996; North & Schneider,
1998; Turner & Upshur, 2002; Underhill, 1987). This interdependence makes it even
more difficult for descriptors to function as stand-alone criteria.
Monotonicity is another reason for the weaknesses of these rating scales (Fulcher,
1996b; Turner & Upshur, 1996, 2002). According to Turner and Upshur (1996, 2002),
typical rating scales assume monotonicity, but, in most cases, empirical ratings are based
on multiple descriptors that are present across different levels. Mapping qualitatively
different multidimensional descriptors onto unidimensional metric scales can influence
raters‟ decisions and impact consistency (Turner & Upshur, 1996). Along the similar
lines, Alderson (1991) and others (i.e., Matthews, 1990; Upshur & Turner, 1995) pointed
out that some rating scales include descriptors associated with abilities that are not
tapped by a test. According to Alderson, the mismatch between descriptors and content
arose during the English Language Testing Service (ELTS) Revision Project. If a test
![Page 50: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/50.jpg)
38
contains only one type of text, then assessing a learner‟s ability to understand a wide
range of texts is meaningless.
From the perspective of assessors, Matthews (1990) discussed some problems of
the international EFL rating scales; the Royal Society of Arts Examination in the
Communicative Use of English as a Foreign Language (CUEFL); the Cambridge First
Certificate in English (FCE); the Certificate of Proficiency in English (CPE); the English
Language Testing Service (ELTS); and the International General Certificate of Secondary
Education (IGCSE). She argued that the evaluation criteria are sometimes arbitrary (in
case of ELTS Non-Academic part), that allocating equal weight across these evaluation
criteria may be unreasonable, and that such criteria were often not clearly defined,
leading to ambiguity.
In summary, these well-founded criticisms associated with many existing rating
scales can be attributed primarily to intuitive or a priori methods of scale development. A
lack of empirical grounds keeps scale developers and assessors from knowing which
evaluation substances should be assessed, resulting in low reliability and validity. This
problem becomes far more serious when a scale is used for diagnostic purposes. The
identification of specific assessment elements is regarded as the most important
procedure in implementing diagnostic assessment because these elements form the basis
of detailed skill profiles. These criticisms will remain until different paradigms and
approaches can be applied to the developmental process. As Brindley (1998) stated:
Rather than continuing to proliferate scales which use generalized and
empirically unsubstantiated descriptors, therefore, it would perhaps be more
profitable to draw on SLA and LT research to develop more specific empirically
derived and diagnostically oriented scales [italics added] of task performance
which are relevant to particular purposes of language use in particular contexts
and to investigate the extent to which performance on these task taps common
components of competence. (p. 134)
The next section will take up Brindley‟s call for empirically-derived scales, and a few
exemplary works will be reviewed in the context of L2 assessment.
Empirically-Based Rating Scales
Empirically-based rating scales have been proposed for language assessment in
response to criticisms of existing scales; the three best-known are the data-driven fluency
![Page 51: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/51.jpg)
39
rating scale (Fulcher, 1987, 1993, 1996b, 1997), the empirically-derived, binary-choice,
boundary-definition (EBB) rating scale (Turner, 2000; Turner & Upshur, 1996, 2002;
Upshur & Turner, 1995, 1999), and the Common European Framework of Reference for
languages (CEFR) scale (North, 1996, 2000; North & Schneider, 1998). Each rating scale
provides valuable insights about that scale‟s development, although they were not
developed for L2 writing assessment per se. While Fulcher‟s data-driven fluency scale
demonstrates that discourse analysis can help to create scale descriptors for L2 oral
performance, Turner and Upshur‟s EBB scale illustrates the effectiveness of a series of
empirical yes/no criteria questions gleaned from actual performances. North‟s descriptor
scaling method in the CEFR also demonstrates that a combination of theoretical and
empirical approaches is useful in developing the framework of reference in which L2
performance levels are determined.
Data-driven fluency rating scale.
In order to define and measure the fluency of L2 learners, Fulcher (1987, 1993,
1996b, 1997) proposed a data-based or data-driven fluency rating scale based on
observations of oral performance. His data-based approach is built on the claim that
observed learners‟ performance should be quantifiable, and that the development
procedures of rating scales should reflect real linguistic performance. In contrast to a
priori methods, this data-based procedure utilized a large database of speech samples,
which were then used to create fluency rating descriptors.
Fulcher (1993, 1996b) collected 21 ELTS oral interviews with scores ranging
from 4 to 9 (average of 6). He identified eight categories using grounded theory (Glaser
& Strauss, 1967) to account for breakdowns in fluency, and counted observations of
these categories in each oral interview. These categories were (a) end-of-turn pauses, (b)
content planning hesitation, (c) grammatical planning hesitation, (d) addition of examples,
counterexamples or reasons to support a point of view, (e) expressing lexical uncertainty,
(f) grammatical and/or lexical repair, (g) expressing propositional uncertainty, and (h)
misunderstanding or breakdown in communication. A discriminant analysis was then
used to examine the extent to which the eight explanatory categories discriminated
among L2 learners, and the extent to which the awarded scores (i.e., frequencies) on the
![Page 52: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/52.jpg)
40
eight explanatory categories predicted the scores awarded on the ELTS oral interview
scale. The results suggested that the eight explanatory categories can easily discriminate
among L2 learners, and that they were consistent with the ELTS oral interview scale in
assigning L2 learners to appropriate scale scores. Fulcher (1996b) considered these
results positive because they supported that this approach attained concurrent validity
before a rating scale was constructed, thus avoiding the weaknesses of post hoc
validation methods. The fluency descriptor scale was finally constructed by scaling the
means of each category score across all ELTS oral interview levels, and by eliciting the
salient characteristics of speech samples that defined the explanatory categories.
During the post hoc validation stage, Fulcher used a variety of statistical tests to
evaluate the five-point fluency rating scale: a G-study, ANOVA, and a Rasch partial
credit model. Applied to a new sample of students using five raters and three different
tasks (a picture description task, an interview based on reading text, and a group
discussion), the fluency scale achieved high reliability (reliability coefficient = 0.9, inter-
rater generalizability coefficient = 0.93, inter-task generalizability coefficient = 0.98).
The results from ANOVA also showed that the fluency rating scale was able to
discriminate among different learner group levels. Finally, while Fulcher failed to show
divergent validity evidence between scales, the scale calibration yielded by the Rasch
partial credit model supported the fluency rating scale, functioning as a stable
measurement instrument. These findings led Fulcher (1996b) to argue for two distinct
advantages to the data-based approach: target ability is defined in great detail so that
more accurate validation study is made possible, and the descriptors are explicit enough
to be linked to real language performance.
Empirically-derived, Binary-choice, Boundary-definition (EBB) scale.
Proposed by Turner (2000), Turner and Upshur (1996, 2002) and Upshur and
Turner (1995, 1999), empirically-derived, binary-choice, boundary-definition (EBB)
rating scales are characterized to be free of theory. EBB rating scales are not constructed
on theoretical models of language ability or learning, but on samples of real oral or
written performance (Turner & Upshur, 1996; Upshur & Turner, 1995). Instead of
generalizing to other contexts, they are usually developed within a particular context and
![Page 53: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/53.jpg)
41
with a particular task and learner group in mind.
EBB scale development involves six steps (see Figure 1): (1) Individual raters
select a minimum of eight to ten samples that represent the full range of test performance,
dividing them into two categories (an upper half and a lower half); (2) As a team, the
raters discuss their decisions and reconcile disagreements, then identify the most
prominent feature distinguishing the upper half from the lower half, which is used to
develop a yes/no question (i.e., Question 1 in Figure 1); (3) Individual raters then rank
the samples in the upper half, and as a team, compare these rankings to determine the
number of scale levels that will effectively distinguish between the upper half samples;
(4) The team next creates a series of yes/no questions (i.e., Questions 2 & 3 in Figure 1)
that define subscale levels within the upper half samples; (5) The raters repeat steps 3
and 4 for the lower half samples; and (6) The raters, as a team, create descriptors of each
level so that the scores can be better understood when they are awarded. As can be seen,
EBB scales differ from traditional scales in that they describe salient differences (rather
than similarities) in the boundaries between score levels, and do not focus on a midpoint
of “normative descriptions of ideal performances” (Turner & Upshur, 1996, p. 55).
Figure 1. A general procedure for EBB scale development
Upshur and Turner (1995) argue that the simplicity and clarity with which EBB
scales distinguish boundaries eliminates the problems inherent in scales with co-
occurring characteristics, minimizing different interpretations of scale descriptors and
enhancing reliability. The floor or ceiling effect is also reduced because raters do not
Question
1
Question
2
Question
4
Question 3
Question
5
Level 6
Level 1
Level 2
Level 3
Level 4
Level 5
Yes
Yes
Yes
Yes
No
No
No
No
No
Yes
![Page 54: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/54.jpg)
42
make assumptions about ability or other features that might not be present in the
performance but use empirical data as the starting point for scale development (Upshur &
Turner, 1995). Score interpretation is particularly meaningful within an educational
context, as scale criteria incorporate what teachers and students actually do in their
classes, embodying instructional and curricular goals (Turner, 2000; Turner & Upshur,
1996; Upshur & Turner, 1995).
Despite the many advantages of EBB scales, their inability to generalize across
different contexts has been criticized (Fulcher, 2003), as has their lack of theoretical
orientation. Indeed, Brindley (1998) suggests that scale development should be built on
empirical substantiation, “supplemented by further theoretically motivated research into
generalizable dimensions of task and text complexity” (p. 135). In addition, although
EBB scales take the defining characteristics of performance samples with different
proficiency levels into consideration when mapping language performance, it is unclear
whether teacher perceptions of skill hierarchy are psychometrically accurate. Research
has shown that teachers often fail to identify a hierarchy of task difficulty on a given test
(Alderson, 1990a, 1990b; Alderson & Lukmani, 1989). If teacher perceptions of task
difficulty hierarchy are correct in Figure 1, then Question 3 must assess higher-order
language skills than Questions 1, 2, 4, and 5. On the EBB rating scale for Audio Pal
(Turner & Upshur, 1996; Upshur & Turner, 1999), for example, teachers assumed that
the ability to speak fluently and use authentic idioms (the question demarcating level 6
from level 5) needed a higher-order language skill than the ability to speak a variety of
sentence structures without making many linguistic errors. This assumption has not been
rigorously investigated, however, and warrants future research. If teachers‟ perceptions
do not converge with statistical results, valid inferences about students‟ language ability
cannot be made.
Common European Framework of Reference for Languages (CEFR).
In Europe, the most notable recent development in language education and
assessment is surely the advent of the Common European Framework of Reference for
Languages ([CEFR], Council of Europe, 2001) (Figueras, North, Takala, Verhelst, &
Avermaet, 2005). The CEFR grew out of a concerted effort by the Council of Europe and
![Page 55: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/55.jpg)
43
the Swiss National Science Research Council to develop a framework of reference
wherein communicative language performances can be scaled using a common meta-
language (North, 2000; North & Schneider, 1998). The primary stated goal of the CEFR
was “to help partners to describe the levels of proficiency required by exiting standards,
tests and examinations in order to facilitate comparisons between different systems of
qualifications” (Council of Europe, 2001, p. 21).
The pilot CEFR development projects took place in 1994-1995 in two phases.6
The first phase focused primarily on speaking and interaction competence in English, and
the second extended to cover noninteractive listening and reading competence in German
and French. Three steps were taken in each phase: (a) a comprehensive pool of
descriptors was created, (b) descriptors were qualitatively validated by consulting teacher
workshops, and (c) descriptors were scaled using teacher assessment and the many-
faceted Rasch model.
CEFR development began with a comprehensive survey of existing scales
describing language proficiency (see North, 1994). In the 1994 study, twenty-seven
scales describing speaking interaction and/or global proficiency were identified,
reviewed, edited into a single sentence form, and used to create a descriptor pool. After
eliminating the descriptors that were negatively worded, stated repetitively, or referenced
to a norm, approximately 1,000 descriptors were usable. A workshop was held in which
100 teachers were divided into small groups to judge the quality of descriptors and assess
students‟ oral performance on video clips. Assessment involved two techniques:
Thurstone‟s (1959, as cited in Pollitt & Murray, 1996) law of comparative judgments and
Smith and Kendall‟s (1963) method. Teachers were shown video clips in which a pair of
learners spoke to each other, and were asked to select the better performance and justify
their decision. This was done to elicit teachers‟ meta-language and to ensure the
descriptor pool was comprehensive enough to capture all instances of language
performance. In Smith and Kendall‟s (1963) method, pairs of teachers were asked to sort
a set of 60-90 descriptors into three or four categories that could represent categories of
language ability, and then to indicate descriptors that were particularly useful or clear.
6 Detailed methodological accounts are well documented in North (1996, 2000) and North and Schneider
(1998).
![Page 56: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/56.jpg)
44
Approximately 400 descriptors were identified using this qualitative validation procedure.
These 400 descriptors were divided into seven primary questionnaires with
proficiency levels ranging from beginner to advanced. Each primary questionnaire
consisted of 50 descriptors, and was connected to the other questionnaires using 10-15
common descriptors. Mini-questionnaires were also created that linked teachers to each
other and to the primary questionnaires that they would not use to rate their own students.
Each mini-questionnaire consisted of a small number of descriptors selected from the
same level of primary questionnaire. A five-point Likert scale was attached to all
descriptors on both the primary and mini-questionnaires. One hundred participating
teachers rated ten students in their own classes (five each from two different classes)
using some of the primary questionnaires. A one-day rating conference was held three
weeks later in which all teachers used the mini-questionnaires to rate pairs of students on
11 video clips. The rating severity of each of the teachers could thus be estimated, and a
common scale constructed.
Collected ratings were entered into the FACET analysis, in which the stability of
linking descriptors was examined and misfitting descriptors were detected.7 Descriptors
that did not fit the model were generally related to sociocultural competence, work (e.g.,
telephoning, meeting, and formal presentations), negation, and pronunciation.8 After
these were eliminated, the remaining descriptors were calibrated on a common rating
scale. Ten cut-offs were set between each scale level, and then merged into six to match
with the levels that had been set for the CEFR.
The 1995 study examined whether the oral interaction scale for English language
can be replicated using other language skills in different languages. A similar procedure
was undertaken to construct listening and reading scales for French, German, and English.
After reviewing and editing a pool of descriptors, approximately 1,000 were found to be
usable. Workshops were held in which 192 teachers (81 French, 65 German and 46
English) evaluated the quality of the descriptors and rated students‟ performance. Four
7 Numerous technical problems occurred during these analyses; these are well documented in North
(1995). 8 According to North and Schneider (1998), the inability to calibrate socio-cultural competence suggests
that the scale is limited to measuring language ability rather than communicative competence. They note
that this result is consistent with the findings of Bachman and Palmer (1982), in which socio-linguistic
competence was distinguished from pragmatic and grammatical competence.
![Page 57: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/57.jpg)
45
questionnaires were constructed from which 61 descriptors were linked to the 1994
English scale. When the FACET analysis was run on the rating datasets from both years,
the reading descriptors did not fit the model characterized by speaking and interaction,
and were thus calibrated separately. The listening descriptors were used to link this
separate reading scale to the other scales (i.e., speaking and interaction), so that reading
and listening descriptors could be analyzed together as a single set. The difficulty levels
between the two scales were adjusted, and the logit values of the listening and speaking
descriptors were highly correlated (r = 0.99) and linearly equated.
Based upon the results of the two pilot studies, the consistency between the two
scales was found to be satisfactory even though the linguistic backgrounds of the two
groups of teachers were different and the content and difficulty range of the two types of
questionnaires were different (North & Schneider, 1998). North and Schneider (1998)
also determined that scale difficulty was stable because the descriptors on similar issues
clustered adjacently onto the scale even though they were drawn from different
questionnaires. This suggests that consistent scales can be constructed in a principled
way using comprehensive surveys of existing scales, theoretical reviews and a priori
validation of descriptors, descriptor scaling based on a measurement model, and
replications of the scale (North & Schneider, 1998).
Theoretically-based and empirically-developed diagnostic rating scale.
In a recent study on L2 writing assessment, Knoch (2007) developed “a
theoretically-based and empirically-developed rating scale” for an L2 diagnostic writing
test and evaluated its diagnostic function. In the first part of her two-phase study, she
examined the existing literature to identify objective discourse measures that were
believed to best discriminate between writing samples at different proficiency levels.
These measures were then pilot-tested on 15 writing samples, and their discriminant
functions were determined based upon descriptive statistics (e.g., means and standard
deviations). In order to confirm that the measures that survived the pilot test had
sufficient discriminant function, 601 writing samples were evaluated and screened based
on their descriptive statistics (i.e., histograms, box-plots, and means) and the ANOVA
results. The resulting refined objective measures were finally used to construct a
![Page 58: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/58.jpg)
46
diagnostic L2 writing rating scale assessing accuracy, fluency, complexity, mechanics,
coherence, cohesion, reader/writer interaction, and content. In the validation stage of the
study, 10 raters assessed 100 writing samples using the rating scale, and the quality of the
rating scale was evaluated using several statistics of the Rasch model: rater separation,
reliability and fit statistics, and scale step calibration. Raters‟ reactions to the scale were
also collected via questionnaires and interviews. After receiving satisfactory statistical
results and positive comments from raters, Knoch concluded that the theoretically-based
and empirically-developed rating scale was useful for a L2 diagnostic writing test.
It is noteworthy that Knoch (2007) attempted to develop a diagnostic L2 writing
scale based upon a theoretical model of L2 communicative competence and an empirical
evaluation of such theory-based models; however, the study had several limitations
related to the development of the rating scale. For example, objective measures were
selected based on the results of descriptive statistics from 15 writing samples. Knoch
explained that the small number of samples prevented the use of inferential statistics,
thus the results might not be decisive; however, she could have used a larger sample and
inferential statistics in the pilot study in order to ensure that appropriate measures would
be applied to scale construction from the beginning. There should also have been a better
explanation of the standard setting procedure. The way in which Knoch selected levels
for the rating scale seems arbitrary and impressionistic, with little evidentiary support.
For example, determining the level of fluency in an essay by counting the number of self-
corrections does not take essay length effect into account. Finally, it is also doubtful
whether it is even reasonable for human raters to assess writing samples using objective
measures. As the ever-growing body of literature on automated essay scoring shows,
machine raters might do so more efficiently.
Summary
Rating scales vary according to a test‟s purpose, audience, scoring methods, and
theoretical and empirical underpinnings. Acknowledging the problems associated with
intuitive or a priori methods in most scales, researchers turned their attention to empirical
methods. Of particular interest were Fulcher‟s data-driven fluency rating scale, Upshur
and Turner‟s EBB rating scale, and the CEFR scale. Unlike committee-based or
![Page 59: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/59.jpg)
47
authority-based scales, these three are noted for their attempt to incorporate real language
performance into rating scale development.
This literature review suggests that assessment techniques built on empirical
sources are promising in that they substantiate the construct to be measured and draw on
concrete rationales and evidence. Empirical assessments create a dialogue among
stakeholders who might attach different philosophies, values, meanings, or purposes to
assessment. In that dialogue, assessment users play an active role as generators of
assessment criteria and interpreters of assessment outcomes, and are not passive listeners.
The nature of context-embeddedness also significantly enhances communication,
highlighting that no assessment can take place in isolation from its context and users.
These features are particularly relevant to the underlying concepts of diagnostic
assessment; in a diagnostic assessment framework, an ongoing dialogue with assessment
users can help to create a consensus about the elements to be evaluated, and can help to
keep them better informed about their particular strengths and weaknesses.
A unified assessment framework could therefore integrate the empirical
approach and diagnostic assessment; evaluation criteria would be identified from real
language performance and confirmed by theoretical accounts, and would then be used to
build a diagnostic assessment model. In that assessment model, each criterion would
represent a single evaluation element. Raters could then concentrate on one element at a
time, without the distraction of having to consider many evaluation criteria
simultaneously. Such a model could be created using an assessment scheme called an
empirically-derived descriptor-based diagnostic (EDD) checklist. The EDD checklist has
the potential to maximize the diagnostic benefit of assessment for various users. In order
to operationalize the model, however, a full understanding of diagnostic assessment is
necessary. The next section will discuss ways in which diagnostic assessment is
approached and implemented.
Approaches to Diagnostic Assessment
Diagnostic assessment is a subject of increasing interest in the language
assessment community, as researchers, recognizing the limitations of proficiency tests,
have turned their attention to assessments that contribute to instruction and curriculum
![Page 60: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/60.jpg)
48
improvement (Alderson, 2005, 2007; Jang, 2005; Shohamy, 1992; Spolsky, 1992).
Kunnan and Jang (2009) note the characteristics of diagnostic assessment as follows:
The main vision in using diagnostic assessment in large-scale and classroom
assessment contexts is to help assess students‟ abilities and understanding with
feedback not only about what students know, but about how they think and learn
in content domains, to help teachers have resources of a variety of research-
based classroom assessment tools, to help recognize and support students‟
strengths and create more optimal learning environments, and to help students
become critical evaluators of their own learning (Pellegrino, Chudowsky, &
Glaser, 2001).
Shohamy (1992) also proposed an integrative diagnostic feedback testing model
describing how diagnosis components should be processed, operationalized, and applied.
In this model, she emphasized that the goal of tests should be improved teaching and
learning. The idea of diagnostic language assessment has also manifested in the
European-funded DIALANG project. This large-scale project operationalized and
validated the idea of diagnostic assessment in 14 European languages and five language
skill domains (reading, listening, writing, grammar, and vocabulary) through a computer-
based test tool.
Although it is a relatively new concept, cognitive diagnostic assessment (CDA)
has also been the cause of significant advancements in diagnostic language assessment.
CDA formatively assesses fine-grained knowledge processes and structures in a test
domain in order to provide detailed information about students‟ understanding of the test
materials (Nichols, 1994; Nichols, Chipman, & Brennan, 1995). This is fundamentally
different from summative assessment, which focuses on placing students onto a
unidimensional continuous scale (DiBello & Stout, 2007; Nichols, 1994; Snow &
Lohman, 1989). CDA assumes that the latent ability space is composed of a set of
knowledge states, skills, or attributes, and places students onto multidimensional space,
representing multiple skill parameters. Students‟ probability of achieving mastery of each
skill is then calculated, and student skill profiles are constructed.
Although only a few studies have explored the potential applications of
psychometric CDA models in language assessment, these provide valuable insight into
how a CDA framework could be incorporated. Buck and Tatsuoka (1998) and Kasai
(1997) applied the Rule-Space Model to L2 listening and reading tests, respectively, and
Jang (2005, 2009a, 2009b) applied the Fusion Model to examine the effectiveness of the
![Page 61: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/61.jpg)
49
skills diagnostic approach to L2 reading on teaching and learning.9 From a slightly
different perspective, Sawaki, Kim, and Gentile (2009) used the Fusion Model to
accurately identify skill coding categories in L2 listening and reading tests.
Successful implementation of CDA requires a series of carefully-designed
substantive and statistical assessment processes. The selection of an appropriate
psychometric CDA model suited for that particular assessment purpose is also a
prerequisite. The next sections will discuss a series of steps involving CDA
implementation and a variety of psychometric CDA models. Of the many CDA models,
the Reduced Reparameterized Unified Model ([Reduced RUM], Hartz, Roussos, & Stout,
2002) will be discussed in-depth because it is the guiding psychometric diagnostic
assessment model used in this study. The Reduced RUM was chosen because it has been
the most extensively investigated model to date (Roussos et al., 2007a). Although it
might be possible to model students‟ ESL academic writing performance using other
conjunctive or compensatory CDA models, their stability has yet to be rigorously
examined. Ways in which the CDA framework has been empirically used in language
assessment in order to estimate student language proficiency will also be discussed.
Implementation of Diagnostic Assessment
DiBello and Stout (2007) consider CDA modeling an engineering science
because it requires cross-disciplinary collaboration, blending insights gained from
psychometrics, cognitive science, and curricular and instructional theories and practices.
It is an iterative and cyclic procedure, consisting of multiple steps (DiBello, Roussos, &
Stout, 2007).10
The CDA modeling process begins with a clear statement of the
assessment purpose, which will determine whether the targeted skill space will be
modeled unidimensionally or multidimensionally and whether student ability parameters
will be classified discretely (mastery/non-mastery) or scaled continuously. Once the
assessment purpose has been defined, the skills to be measured are specified in one of
two ways: if they are to be retrofitted to existing data in order to provide students with
fine-grained diagnostic feedback, they will be identified through substantive content
9 The Reparameterized Unified Model is formerly known as the Fusion Model.
10 For a more detailed description of diagnostic assessment implementation, see DiBello et al. (2007).
![Page 62: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/62.jpg)
50
analysis. If, on the other hand, a new diagnostic test is to be developed, the targeted skills
will be aligned with the test‟s specific purpose and with those theories associated with the
test‟s content domain. Care must be taken when determining granularity of skills, so that
a similar grain size is assigned to each skill.
Test items are then assigned to the target skills or developed based on the
number and kind of skills to be measured, the relationship between skills, and their
difficulty level. The skills-by-items relationship can be conceptualized using an incidence
matrix, known as a Q-matrix (Tatsuoka, 1983). The Q-matrix specifies a relationship
such that the number 1 indicates that a given test item does measure a particular skill,
while a 0 does not. The construction of the Q-matrix requires both theoretical
consideration of the test domain and empirical statistical results because the quality of
the Q-matrix determines the quality of the estimated diagnostic model. A poorly created
Q-matrix provides less informative diagnostic or classification indices.
Once the Q-matrix is completed, it should be determined whether the
relationship among skills for a given item is conjunctive or compensatory. Conjunctive
interaction assumes that the successful completion of an item requires all the necessary
skills and that lack of competence on any one skill will result in failure on the item.
Conversely, compensatory interaction assumes that lack of competence on one skill is
compensated for by the mastery of others. If an appropriate diagnostic model has been
selected considering the skill relationship, it is calibrated and evaluated. Simple models
(involving a small number of skills per item) are preferable because they improve
parameter identification and model calibration and evaluation (DiBello et al., 2007).
Diagnostic results are then yielded, primarily focusing on the diagnostic function of a test
or item as well as the skill profiles of individual students. A user-friendly diagnostic
report is finally constructed and issued to students, teachers, and parents.
Psychometric Diagnostic Assessment Models
Recent advancements in psychometric CDA models have led to their
proliferation, further emphasizing the educational drive to diagnostic assessment.
Although Fischer‟s (1973, 1983) Linear Logistic Test Model (LLTM) failed to model the
ability parameter onto multidimensional space, it is considered the cornerstone of
![Page 63: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/63.jpg)
51
multidimensionality-based diagnostic assessment models in that it represented the skills-
by-items relationship on an incidence matrix. Tatsuoka‟s (1983, 1990, 1993, 1995) rule-
space model is another groundbreaking work that has operationalized knowledge states
based on item response patterns. Many other psychometric models have been generated
to represent the student knowledge structure and to determine their mastery standing for
each skill.
Of the many variables that define the various psychometric CDA models, the
skill mastery scale is one determining factor. When the student knowledge structure is
represented as either mastery or non-mastery, latent class models such as Deterministic-
Input, Noise-And ([DINA], Haertel, 1989), Deterministic-Input, Noise-Or ([DINO],
Templin & Henson, 2006), Noise-Input, Deterministic-And ([NIDA], Junker & Sijtsma,
2001), or Reparameterized Unified Model ([RUM], DiBello, Stout, & Roussos, 1995;
Hartz, 2002; Hartz et al., 2002) are appropriate. On the other hand, if examinees are to be
scaled onto a continuous ability continuum, latent trait models such as the compensatory
multidimensional IRT model ([MIRT-C], Reckase & McKinley, 1991) and the
noncompensatory multidimensional IRT model ([MIRT-NC], Sympson, 1977) can better
structure the knowledge state. The ways in which skills interact with each other in an
item also characterize the nature of models. Conjunctive diagnostic models (e.g., DINA,
NIDA, RUM, MIRT-NC) require all necessary skills to be utilized to get an item correct,
while compensatory diagnostic models (e.g., DINO, MIRT-C, RUM) allows
compensation for low competence in one skill with high competence in others. The
completeness of the Q-matrix can also distinguish one psychometric model from another.
Some diagnostic approaches take mastery of non-Q skills into consideration (e.g., RUM),
while others do not.
The Reduced Reparameterized Unified Model (Reduced RUM) is a latent class
conjunctive model because it assumes that students‟ latent ability space can be
dichotomized into mastery and non-mastery and that students must master all required
skills to get an item correct (Roussos et al., 2007b). In a Q-matrix representation, items i
= 1, …, I are defined associated with skills k = 1, …, K, with qik = 1 indicating that skill k
is required by item i whereas qik = 0 indicating that skill k is not required by item i.
Examinees‟ ability parameters are thus modeled as
![Page 64: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/64.jpg)
52
1 if examinee j has mastered skill k
jk =
0 otherwise
In the Reduced RUM, the probability of a correct response is modeled as
P (Xij = 1j ) = *
i
K
k
q
ikikjkr
1
)1(*
The parameter *
i is the probability of correctly applying all of the Q-specified
skills to solve item i assuming that a student has mastered all of these skills. It can be
understood as item difficulty, and values of *
i less than 0.6 suggest that items assigned
to the skills are too difficult (Roussos et al., 2007b). The parameter r *
ik is the probability
ratio for a correct item between mastery and non-mastery of a skill k. It is analogous to
an inverse indicator of how well an item discriminates on Q-specified skills. Values of r
*
ikless than 0.5 indicate that items have strong discriminant power, whereas values of r *
ik
greater than 0.9 suggest that items are not discriminating for skill k. When items are
found not to have strong discriminant power, the “1” entries in the Q-matrix should be
eliminated (Roussos et al., 2007b).
Arpeggio (DiBello & Stout, 2008) is estimation software for the Reduced RUM
that employs a Markov Chain Monte Carlo (MCMC) algorithm within a Bayesian
modeling framework. MCMC convergence can be examined by visually inspecting chain
plots, distributions of estimated posterior, autocorrelations of the chain estimates, and
computing Gelman and Rubin R̂ in multiple chains (Roussos et al., 2007b). If chains or
posteriors are stably distributed or if autocorrelations are low after the burn-in phase,
convergence has occurred. When multiple chains are employed, R̂ values less than 1.2
are also indicative of convergence (for more details, see Gelman, Carlin, Stern, & Rubin,
1995; Gelman & Rubin, 1992).
After convergence has been achieved, parameter estimates are evaluated in order
to enhance statistical power. The estimates for the *
i and r *
ik parameters are the
critical factors determining the diagnostic capacity of test items, and should be carefully
examined. If they do not contribute useful diagnostic information to the item response
function in relation to Q-specified skills, the elimination of Q entries is considered in a
![Page 65: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/65.jpg)
53
stepwise manner. Dropping non-influential item parameters from the Q-matrix should be
undertaken carefully based on both substantive and statistical grounds (Roussos et al.,
2007b). Once the parameter estimates are evaluated, model fit needs to be examined in
various ways. If the model fit is satisfactory, the diagnostic quality of the model is
examined and a skill mastery profile is constructed using the item and student statistics
generated by Arpeggio.
Applications of Diagnostic Assessment Models to L2 Assessment
Only a handful of studies have explored potential applications of psychometric
CDA models in L2 assessment and testing, possibly because such models are newer, and
have subsequently been less explored. Buck and Tatsuoka (1998) were the first to apply
CDA models to L2 language assessment. Utilizing Tatsuoka‟s (1983, 1990, 1993, 1995)
revolutionary work on rule-space methodology, they identified the cognitive and
linguistic attributes underlying a L2 listening comprehension test and classified
examinees into specific knowledge states. The rule-space methodology deconstructs
items that assess a targeted ability into several attributes or skills representing the
underlying knowledge structure, and estimates the probability that each examinee has
mastered each attribute based on correct or incorrect response patterns. Buck and
Tatsuoka analyzed the responses of 412 Japanese students on 35 dichotomously-scored
L2 listening comprehension test items, and identified 71 attributes representing the L2
listening construct. Using visual inspection, correlations with item difficulty, and
multiple regression, they reduced the number of attribute candidates to 17. An incidence
Q-matrix was then constructed using these 17 attributes and analyzed using the rule-
space procedure. Fourteen interactions among attributes were identified, and a total of 31
attributes (17 prime attributes and 14 interactions) classified 91% of examinees into
specific knowledge states. The prime attributes set was modified to fully explain the
response patterns of the remaining 9%, resulting in the reduction of the number of prime
attributes. In the second run of the rule-space procedure, 15 prime attributes and 14
interactions classified 96% of the examinees into specific knowledge states.
Although the rule-space methodology successfully classified examinees with
different ability levels into appropriate knowledge states, it had several limitations. Most
![Page 66: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/66.jpg)
54
significantly, Buck and Tatsuoka noted that the use of multiple regression could cause
useful attributes to be rejected. They also pointed out that their attributes set did not
include variables related to vocabulary or syntactic complexity, and expressed
reservations about the extent to which the identified attributes could be generalized to
other L2 listening tests. Finally, even though they believed the results from the rule-space
methodology could be used to develop a diagnostic report, they called for further
research into the ways in which complicated attribute-based rule-space results could be
easily communicated to teachers and students.
Jang‟s (2005, 2009a, 2009b) study on L2 reading is the most comprehensive and
thorough example of how a series of CDA techniques can be utilized. Two forms of the
reading subtest in the LanguEdge English Language Learning Assessment were used to
examine the effects of diagnostic assessment approaches to a large-scale L2 reading
comprehension test on teaching and learning practices. A three-phase study was designed
involving multiple data sources and procedures. In the first phase, substantive and
statistical analyses were used to identify the knowledge structure of the reading test; 12
ESL students provided verbal reports describing their reading processes and strategies,
and nonparametric latent dimensionality analyses utilizing CCPROX/HCA (Roussos,
Stout, & Marden, 1998), DIMTEST (Stout, Froelich, & Gao, 2001), and DETECT
(Zhang & Stout, 1999) evaluated the proposed skills-by-items dimensional structure.
Nine reading subskills were substantively and statistically identifiable, and the resultant
skill set was entered into the Q-matrix construction.11
Jang (2005, 2009a, 2009b) then used the Reduced RUM (Hartz et al., 2002) to
evaluate the quality of skill profiles, with special attention to model calibration, skill
homogeneity, and performance differences between masters and non-masters. Six and
seven entries of the Q-matrix had to be eliminated from the two forms of the reading
subtest, respectively, due to low item discrimination power for an assigned skill, and
approximately 20 % of the items were determined to be diagnostically less informative.
In an attempt to examine the application of the skill diagnostic approach in a real L2
11
The nine reading subskills include (a) deducing word meaning from context, (b) determining word
meaning out of context, (c) recognizing syntactic elements/discourse markers and integrating syntactic and
semantic links, (d) processing explicit information, (e) paraphrasing implicit information, (f) processing
negative statements, (g) inferential comprehension process, (h) summarizing major ideas, and (i) mapping
contrasting ideas into a conceptual framework.
![Page 67: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/67.jpg)
55
reading instruction context, ESL teachers and students were interviewed and surveyed
about its usefulness and effectiveness on their teaching and learning practices. Score
differences among 27 students enrolled in two TOEFL preparation courses were also
examined in pre-instruction and post-instruction settings. The results showed that some
of the students‟ reading subskills improved after instruction, and that both teachers and
students viewed the diagnostic approach positively. Jang proposes that skills diagnostic
assessment can have a positive effect on both teaching and learning, and suggests that
when a diagnostic test is aligned with learning and cognition theories, it will contribute to
more meaningful diagnostic assessment.
Lee and Sawaki (2009a) examined the comparability of the General Diagnostic
Model ([GDM], von Davier, 2005), the Fusion Model ([FM], Hartz et al., 2002), and
Latent Class Analysis ([LCA], Yamamoto & Gitomer, 1993) on the reading and listening
subtests of the two field test forms of the TOEFL iBT. The two test forms consisted of 39
and 40 reading items, respectively, and 34 listening items. Two groups of TOEFL test-
takers (2,720 and 419 test-takers, respectively) took one form of the tests and a small
subsample of test-takers (374 test-takers) took both test forms. Four reading and four
listening skill categories were developed, and the items were coded and entered into a Q-
matrix.12
The results indicated that all three models appropriately classified test-takers
into a mastery or non-mastery state for most reading and listening skills, and that a
moderate degree of across-form consistency was achieved for most reading and listening
skills. When the skill profiles were examined across the three models, a great number of
test-takers were classified into flat profiles: “1111” (mastered all) and “0000” (mastered
none). Lee and Sawaki (2009a) speculated that the inability to identify the
multidimensional structure of the TOEFL reading and listening subtests might be because
the test was developed on a single latent continuum, thus providing some support for the
test‟s unidimensionality. However, the level of granularity might have been inappropriate,
requiring validation from empirical evidence such as students‟ think-aloud verbal reports
(Sawaki, Kim, & Gentile, 2009). The Q-matrix also needs to be validated because
12
The reading skills include (a) understanding word meaning, (b) understanding specific information, (c)
connecting information, and (d) synthesizing and organizing information. The listening skills were (e)
understanding general information, (f) understanding specific information, (g) understanding text structure
and speaker intention, and (h) connecting ideas.
![Page 68: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/68.jpg)
56
poorly-defined skills-by-items relationships often result in flat profiles.
Limitations of Diagnostic Assessment Models
Although parametrically complex CDA models have made considerable progress,
it has been argued that such models must be substantively validated using internal and
external criteria in a real assessment context (DiBello et al., 2007). Most applications
tend to retrofit skills to pre-existing proficiency tests, and thus the effect on a carefully
designed diagnostic test is unknown (DiBello et al., 2007; Lee & Sawaki, 2009b; Jang,
2009a, 2009b). The lack of guidelines by which to identify skills and systematically
construct a Q-matrix is also problematic (Lee & Sawaki, 2009b). Theoretical and
empirical principles are needed to support well-defined skills-by-items representations.
Another challenge is finding an efficient means of communicating diagnostic results to
students, teachers, and other stakeholders (DiBello et al., 2007). CDA models compute a
large number of parameters, but little research (except for Jang‟s [2005, 2009a]
DiagnOsis) has been conducted to develop an effective score reporting method.
Numerically overwhelming score reporting procedures threaten both usefulness and
practicality, and ultimately prevent easy communication with stakeholders. A final
limitation is the availability of computer software (Lee & Sawaki, 2009b). Technical
developments are still in their early stages, and promising in-house software has rarely
been made commercially available. Wider accessibility is needed to validate its model
calibration and evaluation in diverse research areas.
![Page 69: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/69.jpg)
57
CHAPTER 3
METHODOLOGY
Research Questions
The central research questions were formulated based upon the argument-based
approach to validity, as follows:
1) What empirically-derived diagnostic descriptors are relevant to the construct
of ESL academic writing?
2) How generalizable are the scores derived from the EDD checklist across
different teachers and essay prompts?
3) How is performance on the EDD checklist related to performance on other
measures of ESL academic writing?
4) What are the characteristics of the diagnostic ESL academic writing skill
profiles generated by the EDD checklist?
5) To what extent does the EDD checklist help teachers make appropriate
diagnostic decisions and have the potential to positively impact teaching and
learning ESL academic writing?
Research Design Overview
This is a two-phase study. Phase 1 concerns the development of the EDD
checklist, while Phase 2 pilots, models, and evaluates the EDD checklist. Given the
complex nature of argument-based validation inquiry, this study followed a mixed
methods research design. A mixed methods approach strives for knowledge claims
grounded on pragmatism and incorporates quantitative and qualitative research methods
and techniques, either simultaneously or sequentially, into a single study (Creswell,
2003). The use of multiple methods has the potential to reduce biases and limitations
inherent in a single method while strengthening the validity of inquiry (Greene, Caracelli,
& Graham, 1989). A series of validity arguments and assumptions determined the types
of data to be collected, which were then analyzed and synthesized using both quantitative
and qualitative methods. Of many mixed methods designs, an expansion design (see
Greene et al., 1989 for a review of mixed methods evaluation designs) was particularly
![Page 70: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/70.jpg)
58
well suited to this study because it offered a comprehensive understanding of the EDD
assessment, examining diverse aspects of the validity claims. A complementarity design
was also pertinent because it investigated overlapping but different aspects of the EDD
score-based interpretations and uses that different methods might have elicited. The same
weight was given to both quantitative and qualitative methods. Table 2 contains a
summary of the research questions, participants, instruments/data, and
procedures/analyses over the two phases.
![Page 71: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/71.jpg)
59
Table 2
Research Design Summary
Phase Research question Participants Instrument/Data Procedure/Analysis
1 1) What empirically-derived
diagnostic descriptors are
relevant to the construct of
ESL academic writing?
9 ESL teachers
10 TOEFL essays (5 proficiency
levels × 2 prompts)
A think-aloud verbal protocol
Nine ESL teachers thought aloud while
assessing and providing diagnostic
feedback on 10 TOEFL essays.
4 ESL academic writing
experts
EDD descriptors Four ESL academic writing experts
reviewed and sorted EDD descriptors
that would constitute the checklist.
2 2) How generalizable are the
scores derived from the EDD
checklist across different
teachers and essay prompts?
7 ESL teachers
80 TOEFL essays (40 essays × 2
prompts)
EDD checklist
Teacher questionnaire I
Interview protocol
In the pilot study, seven ESL teachers
assessed 80 TOEFL essays using the
EDD checklist. They were then asked to
complete a questionnaire and were
interviewed about the use of the
checklist. The preliminary analysis was
conducted using FACETS to examine
score generalizability.
3) How is performance on the
EDD checklist related to
performance on other
measures of ESL academic
writing?
7 (and 10) ESL teachers
Scores awarded using the EDD
checklist on 80 (and 480) TOEFL
essays
Scores awarded by ETS raters on
80 (and 480) TOEFL essays
A correlation analysis was conducted.
4) What are the characteristics of
the diagnostic ESL academic
writing skill profiles generated
by the EDD checklist?
10 ESL teachers 480 TOEFL essays (240 essays ×
2 prompts)
EDD checklist
Teacher questionnaire II
Interview protocol
In the main study, 10 ESL teachers
assessed 480 TOEFL essays using the
EDD checklist. The scored data were
analyzed to examine the dimensional
structure of ESL writing. The diagnostic
quality of the estimated model was then
examined using the Reduced RUM. The
teachers also completed a questionnaire
and were interviewed to evaluate the use of the EDD checklist.
![Page 72: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/72.jpg)
60
Table 2 (Continued)
Phase Research question Participants Instrument/Data Procedure/Analysis
5) To what extent does the EDD
checklist help teachers make
appropriate diagnostic
decisions and have the
potential to positively impact
teaching and learning ESL
academic writing?
10 ESL teachers Questionnaire and interview
results
The teachers‟ questionnaire and
interview results were analyzed for their
positive or negative reactions to the use
of EDD checklist
![Page 73: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/73.jpg)
61
Phase 1
1) What empirically-derived diagnostic descriptors are relevant to the construct
of ESL academic writing?
The primary purpose of Phase 1 was to identify descriptors that are relevant to
the construct of ESL academic writing. Nine ESL teachers participated in a think-aloud
session to verbalize their thought processes while assessing and providing feedback on
10 TOEFL iBT independent essays. These verbal accounts provided rich descriptions of
ESL academic writing ability and served as the base for constructing the pool of EDD
descriptors. The recorded verbal data were fully transcribed and coded iteratively in order
to identify distinct ESL academic writing subskills and textual features. Four ESL
academic writing experts then reviewed the identified descriptors and sorted them into
dimensionally distinct writing skills. Based upon the experts‟ review and sorting
outcomes, the EDD checklist was constructed.
Phase 2
2) How generalizable are the scores derived from the EDD checklist across
different teachers and essay prompts?
3) How is performance on the EDD checklist related to performance on other
measures of ESL academic writing?
4) What are the characteristics of the diagnostic ESL academic writing skill
profiles generated by the EDD checklist?
5) To what extent does the EDD checklist help teachers make appropriate
diagnostic decisions and have the potential to positively impact teaching and
learning ESL academic writing?
The primary purpose of Phase 2 was to pilot, model, and evaluate the EDD
checklist. Eleven ESL teachers participated in Phase 2, with seven participating in the
Phase 2 pilot study and ten participating in the Phase 2 main study. Six teachers
participated in both the pilot and main studies. The seven ESL teachers who participated
in the pilot study assessed 80 TOEFL iBT independent essays and preliminarily
evaluated whether the checklist functioned as intended. Once the functionality of the
checklist was determined, 10 ESL teachers participated in the main study to assess 480
![Page 74: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/74.jpg)
62
TOEFL iBT independent essays and to evaluate the use of the checklist. The validity
assumptions that formulated the four research questions in Phase 2 were critically
examined from diverse perspectives using multiple data sources. In order to gain a
comprehensive view of the use of the EDD checklist, both quantitative and qualitative
data were collected and analyzed, and findings were integrated and synthesized in a
complementary manner.
Participants
TOEFL iBT Writing Test Participants
The TOEFL iBT writing test participants consisted of 480 ESL learners who took
the test at domestic (i.e., North American) or international test centers. Half of the test-
takers participated in the TOEFL iBT administration in the fall of 2006 (hereafter Form
1), and the other half participated in the spring of 2007 (hereafter Form 2). Test-takers
were 14 to 51 years of age (M=23.61, SD=6.40), and approximately the same percentage
of male and female test-takers participated in each test administration. Test-takers came
from 76 different countries and spoke 52 different languages as a first language. Test-
takers who spoke Chinese as a first language accounted for the largest number of the test-
takers, followed by Korean, Spanish, and Japanese (see Table 3). When the distribution
of test-takers was examined according to language group, the number of test-takers who
spoke non-Indo-European languages (59.58%) was greater than the number of test-takers
who spoke Indo-European languages (40.42%; see Table 4). Test-takers‟ primary reason
for taking the TOEFL was to enter a college or a university as either an undergraduate
student (18.13%) or a graduate student (21.04%).
Table 3
The Four Largest First Language Groups
First language Form 1 Form 2 Total
f % f % f %
Chinese 43 17.92 40 16.67 83 17.29
Korean 21 8.75 35 14.58 56 11.67
Spanish 29 12.08 23 9.58 52 10.83
Japanese 18 7.50 32 13.33 50 10.42
Total 111 46.25 130 54.17 241 50.21
![Page 75: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/75.jpg)
63
Table 4
Distribution of Test-Takers by Language Groups
Language group Form 1 Form 2 Total
f % f % f %
Indo-European 106 44.17 88 36.67 194 40.42
Non-Indo-European 134 55.83 152 63.33 286 59.58
Total 240 100.00 240 100.00 480 100.00
ESL Academic Writing Teachers
Sixteen experienced ESL teachers were recruited from a college-level language
institute in Toronto, Canada. The recruitment process was followed based on an ethics
review protocol submitted to the ethics review board of the University of Toronto. All
ESL teachers were native English speakers with varying experience (2 to 25 years;
average 8.06 years) teaching ESL writing to adult learners. Eleven teachers held or were
pursuing a graduate degree in Applied Linguistics or Second Language Education, and
13 held a certificate in Teaching English as a Second Language (TESL). All teachers self-
assessed themselves as familiar with and competent in assessing the written English of
non-native English speakers. Eleven teachers also reported that they had been trained to
assess ESL writing. Nine teachers participated in Phase 1, 11 teachers participated in
Phase 2, and four teachers participated in both Phases. Of the 11 teachers who
participated in Phase 2, seven participated in the pilot study, ten participated in the main
study, and six participated in both the pilot and main studies. Detailed background
information about the ESL teachers is presented in Appendix B.
ESL Academic Writing Experts
Four doctoral students with substantial knowledge and research experience in
ESL writing (hereafter referred to as ESL writing experts) were recruited from a Second
Language Education Program at a research-intensive university in Canada. The ESL
writing experts included three males and one female. Two of the experts were native
English speakers, while the other two were a native Korean and native Arabic speaker.
All of the ESL writing experts had extensive research experience related to teacher
![Page 76: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/76.jpg)
64
feedback, motivation, writing conferencing, process and assessment (see Table 5). They
also had varying experience of teaching ESL writing to non-native English speakers at
the university level.
Table 5
Profile of ESL Writing Experts
Expert Gary Jane Anthony Alex
Age 40-49 30-39 30-39 20-29
Gender Male Female Male Male
First
language English English Korean Arabic
Teaching
experience
12 years at
university level
4 years at
university level
3 years at
university level
3 years at
university level
Research
area in ESL
writing
Teacher
feedback &
student
revision, ESL
writing
curriculum
design and
program
development
Motivation in
ESL writers at
university level,
writing pedagogy
and assessment
Writing
conferencing via
computer-
mediated
communication,
feedback on ESL
writing
Collaborative
writing, ESL
writing process
and assessment
Note. Pseudonyms were used in order to obscure the experts‟ identities.
Instruments
TOEFL iBT Writing Samples
The writing samples used in this study were requested from the Educational
Testing Service (ETS) in New Jersey, U.S. ETS administered two forms of the retired
TOEFL iBT at various international and domestic test centers in the fall of 2006 and the
spring of 2007.13
The purpose of the TOEFL iBT is to assess test-takers‟ ability to
communicate effectively in English in an academic context, focusing on their language
skills in reading, listening, speaking, and writing. The test is delivered on computer via
the Internet and takes about four hours to complete all four sections. The TOEFL iBT
writing section consists of two tasks (one integrated and one independent), and responses
13
A retired test is one for which the operational form is no longer used and for which the items are
considered exposed.
![Page 77: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/77.jpg)
65
must be typed into the computer. While the integrated task requires test-takers to write a
summary after reading and listening to a passage, the independent task requires test-
takers to write an essay based upon their knowledge and experience. The responses are
scored by two to four trained and certified ETS human raters according to a five-point
holistic rating scale.
ETS provided me with 480 TOEFL iBT independent essays written on two
different prompts (240 essays × 2 prompts) along with additional test-taker background
information. The two writing prompts are:
(a) Do you agree or disagree with the following statement? It is more important
to choose to study subjects you are interested in than to choose subjects to
prepare for a job or career. Use specific reasons and examples to support
your answer. (hereafter referred to as the subject prompt)
(b) Do you agree or disagree with the following statement? In today‟s world, the
ability to cooperate well with others is far more important than it was in the
past. Use specific reasons and examples to support your answer. (hereafter
referred to as the cooperation prompt)
Table 6 presents the score distribution of the 480 TOEFL iBT independent essays.
Each essay was rated by two ETS raters and the average was reported. Although few
essays were awarded a score of 1 or 1.5, the score distribution took an approximate bell-
curve shape. New four-digit code numbers were assigned to all essays, with code
numbers from 1000 to 1240 indicating essays written on the subject prompt, and code
numbers 2000-2240 indicating essays on the cooperation prompt.
Table 6
Score Distribution of the TOEFL iBT Independent Essays
Score Subject prompt Cooperation prompt Total
f % f % f %
1.0 1 0.42 2 0.83 3 0.63
1.5 3 1.25 2 0.83 5 1.04
2.0 18 7.50 20 8.33 38 7.92
2.5 32 13.33 25 10.42 57 11.88
3.0 65 27.08 55 22.92 120 25.00
![Page 78: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/78.jpg)
66
Table 6 (Continued)
Score Subject prompt Cooperation prompt Total
f % f % f %
3.5 40 16.67 45 18.75 85 17.71
4.0 32 13.33 38 15.83 70 14.58
4.5 35 14.58 28 11.67 63 13.13
5.0 14 5.83 25 10.42 39 8.13
Total 240 100.00 240 100.00 480 100.00
Think-Aloud Verbal Protocol
A think-aloud verbal protocol (Ericsson & Simon, 1993) was developed to elicit
teachers‟ thought processes while they were providing diagnostic feedback on the
TOEFL essays. Appendix C outlines the procedures for each think-aloud session. The
instruction was carefully scripted in great detail, including intervention prompts and
follow-up interview questions. The protocol also included a teacher background
questionnaire.
Teacher Questionnaire and Interview Protocols
A two-part questionnaire was developed to examine how teachers evaluated the
EDD checklist (see Appendix D). The first part asked about teachers‟ (a) personal
background, (b) teaching experience, and (c) assessment experience, and the second
focused on teachers‟ (d) evaluation of the EDD checklist. In the evaluation section,
teachers were asked to determine whether each EDD descriptor was clear/not clear,
redundant/non-redundant, useful/useless, and relevant/irrelevant to ESL academic writing.
Open-ended questions were also included to further investigate the strengths and
weaknesses of the EDD checklist as well as the most or least important descriptors in
assessing ESL academic writing skills. An interview protocol was also developed to
discuss how teachers felt the use of the EDD checklist would impact their teaching and
assessment practices. The guiding interview questions are outlined in Appendix E.
![Page 79: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/79.jpg)
67
Data Collection and Analysis Procedures
Phase 1
Think-Aloud Verbal Protocol Procedure
An individual meeting was set up with each of the teachers as guided by the
think-aloud verbal protocol (see Appendix C). Nine ESL teachers participated in
individual think-aloud sessions in which they verbalized their thought processes while
providing diagnostic feedback on 10 essays. In order to capture all instances of different
essay characteristics, three sets of 10 essays were carefully selected according to the
prompt on which they were written, the scores awarded by ETS raters, and the essay
length. Each essay set contained ten essays written on two different prompts covering a
wide range of score levels (5 essays × 2 prompts). Due to the small number of essays at
the score level 1 (n=1 for the subject prompt and n=2 for the cooperation prompt; see
Table 6), only Essay Set 3 contained essays awarded a score of 1. Essays awarded a score
of 1.5, 2.5, 3.5, or 4.5 were not selected because these essays revealed score
disagreement among ETS raters. Table 7 presents the score distribution of the 10 essays
in each set.
Table 7
Score Distribution of the Three Essay Sets
Score Essay Set 1 Essay Set 2 Essay Set 3
f % f % f %
1 0 0 0 0 2 20
2 2 20 2 20 2 20
3 4 40 2 20 2 20
4 2 20 4 40 2 20
5 2 20 2 20 2 20
Total 10 100 10 100 10 100
A textual analysis was conducted in order to examine the characteristics of these
essays. The VocabProfile English version 3.0 (Cobb, 2006) was used to calculate the
number of words, number of word types, percentages of K1 (the most frequent 1,000
word families), K2 (the most frequent 2,000 word families), and AWL (Academic Word
List) words, and lexical density. Spelling errors were corrected before the essays were
![Page 80: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/80.jpg)
68
run through the program, so that misspelled words were not considered off-list. As
Appendix F shows, essays at the different score levels exhibited drastically different
profiles in terms of essay length and vocabulary sophistication. This result confirmed that
the three essay sets represented a wide range of ESL academic writing characteristics.
Three teachers were assigned to assess each essay set. The essays were randomly ordered
in order to counterbalance routine order effects. Appendix G provides the order of the
essays presented to each teacher.
When a think-aloud session was held, considerable attention was paid to the
timing in which teachers would provide their thought processes. Two possible verbal
reporting methods (i.e., concurrent and immediate retrospective) were introduced to and
chosen by the teachers. The concurrent think-aloud method required teachers to verbalize
their thought processes while reading and providing diagnostic feedback on essays with
no time delay and was thought to be effective in minimizing memory loss. The
immediate retrospective think-aloud method allowed teachers to read an essay first either
silently or aloud, and then to speak their thoughts aloud. Although the retrospective
method increased the potential for memory loss, it would improve concentration by
allowing teachers to read the essays without interruption. Teachers were provided with an
explanation of the two think-aloud methods, and were allowed to choose the method they
thought would work best for them. After trying both methods, three ESL teachers
selected the concurrent method, and six chose the retrospective method. The teachers
who preferred the retrospective method reported that the concurrent method interfered
with the reading process and did not effectively elicit their natural cognitive responses.
Each time the teachers completed a think-aloud report, they were interviewed in
order to clarify any unclear statements or ambiguous comments they had made. The role
of interviewer was minimized as much as possible so as not to unduly influence the
feedback. After they had completed the think-aloud process for all 10 essays, teachers
were asked to assign a mark to each using the TOEFL iBT independent writing rating
scale. A comparison of the scores awarded by ETS raters and those awarded by the
teachers was made in order to determine whether the teachers‟ assessments were
consistent with the ETS assessments and whether their verbal report data were a reliable
enough source to be used in creating an assessment tool. In the follow-up interview,
![Page 81: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/81.jpg)
69
teachers were also asked what skills or strategies they thought should be diagnosed in
ESL academic writing.
Each think-aloud and follow-up interview session lasted two to three hours. With
the permission of the teachers, all verbal reports and interviews were tape recorded and
immediately transcribed. Teachers‟ background information was collected, along with the
scores that they awarded to the 10 essays. When each session was over, all of the
assessment materials were collected for security purposes. The verbal data were
transcribed using Microsoft® Word, and the score data were entered into Microsoft
®
Excel spreadsheets. When data from each tape recorded think-aloud session were
transcribed, any text read directly from an essay was italicized. Appendix H presents
excerpts from teachers‟ think-aloud verbal transcripts.
Analysis of Teachers‟ Think-Aloud Verbal Protocols
Teachers‟ verbal accounts and interview reports provided rich descriptions of
ESL academic writing ability and served as the base for constructing the pool of EDD
descriptors. Recorded verbal data were transcribed in full and then reviewed iteratively to
identify the distinct ESL academic writing subskills and textual features that would
constitute the EDD descriptors. Grounded theory (Glaser & Strauss, 1967) was the
principle methodology used to determine the emerging descriptors with varied properties
and dimensions. The analysis was done in several steps: first, transcripts of each
teacher‟s think-aloud verbal protocols were grouped under the same essay sets. The
transcripts were then divided according to essay score levels, with those describing high-
scored essays analyzed separately from those of low-scored essays. Finally, the
transcripts from the follow-up interviews were referenced when necessary for accurate
analysis.
Transcripts ranged from 5,422 to 8,504 words (i.e., from 9 to 17 single-spaced
typed pages) per teacher (see Table 8). Each transcript was read through, categorized, and
segmented into meaningful units using the computer program NVivo 8 (QSR, 2008). The
unit of analysis was one distinct evaluation theme that characterized ESL writing
subskills and textual features, and each evaluation theme represented one distinct EDD
descriptor. Ambiguous or hard-to-interpret evaluative comments were excluded from
![Page 82: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/82.jpg)
70
analysis, while comments that were too general, such as “good language,” “good
introduction,” “I love the thought”, were disregarded because the analysis focused on
identifying fine-grained diagnostic evaluation themes. The transcripts were thus coded at
micro-level ESL writing skills in order to come up with specific EDD descriptors. In
assessing a writer‟s vocabulary knowledge, for example, several different aspects were
identified and coded (e.g., word sophistication, word variety, word choice, collocation,
etc.) instead of having one general evaluation criterion called “vocabulary”.
Table 8
Volume of the Teachers‟ Think-aloud Transcripts
Teacher No. of words Length of transcripts (pages)
Ann 8,504 17
Beth 7,192 13
Esther 5,422 9
George 7,915 13
James 6,597 11
Judy 7,382 12
Sarah 6,910 12
Shelley 8,873 15
Tim 6,252 10
Mean 7,227 12
Total 65,047 112
Note. Pseudonyms were used in order to obscure the teachers‟ identities.
The analysis of the transcripts resulted in a final total of 1,715 segments
representing 39 EDD descriptors. Each of the 39 EDD evaluative themes was then
reviewed based upon theories of ESL writing and a variety of existing ESL writing
assessment schemes developed by Jacobs et al. (1981), Hamp-Lyons and Henning (1991),
Brown and Bailey (1984), ETS (2007), and University of Cambridge, British Council,
and IELTS Australia (2007). Descriptors in these schemes were considered the
preliminary theoretical and practical guidelines that could be used to justify the EDD
descriptors. Along with this preliminary analysis, a more in-depth theoretical review of
the descriptors was conducted later by four ESL academic writing experts.
![Page 83: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/83.jpg)
71
I coded all 1,715 segments, and a second coder independently coded the original,
uncoded, segmented transcripts of each teacher‟s think-aloud reports on two essays (515
segments; approximately 30.03% of all segments) in order to examine inter-coder
reliability. The second coder was a PhD student specializing in Second Language
Education with substantial knowledge of ESL writing. She was provided with a coding
scheme consisting of 39 descriptors and agreed definitions prior to beginning work.
When discrepancies occurred between the second coder and me, the areas of
disagreement were revisited and discussed in order to facilitate resolution.
ESL Academic Writing Experts‟ Descriptor Review and Sorting
Four ESL writing experts participated in a focus group meeting to review the
EDD descriptors elicited from the teachers‟ think-aloud verbal data. They were provided
with six TOEFL essays (3 essays × 2 prompts) along with 39 EDD descriptors. Once the
experts had read the essays and had a general understanding of the writing context, they
were asked to review each descriptor and to discuss whether it was clear/not clear,
redundant/non-redundant, useful/useless or relevant/irrelevant to ESL academic writing.
When necessary, the teachers‟ think-aloud transcripts were made available to them to
have a better understanding of the ways in which the EDD descriptors were elicited. The
writing experts were also asked to determine whether each descriptor was independent of
the others and conducive to making a binary (yes or no) choice or a four-point Likert
(strongly agree, somewhat agree, somewhat disagree, or strongly disagree) choice. When
the wordings of the descriptors were not clear, the experts edited them. After examining
each descriptor, they were also asked whether the descriptor pool was comprehensive
enough to cover all aspects of ESL academic writing. Any missing theoretical aspects
were added to the descriptor pool based upon existing theories of ESL academic writing.
The meeting lasted approximately two hours and was tape recorded in its entirety.
One month later, the same four ESL academic writing experts were invited to
individual meetings where they sorted the reviewed EDD descriptors into dimensionally
distinct ESL writing skills. This sorting activity was preceded by two phases: first, each
writing expert was asked to come up with his or her own skill identification scheme
while sorting the descriptors; then he or she was asked to sort the descriptors using the
![Page 84: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/84.jpg)
72
predetermined sorting categories. The purpose of the first sorting activity was to examine
how ESL teachers conceptualize the underlying structure of ESL writing ability, while
the second identified the skills-by-descriptors relationship needed to construct a Q-matrix.
The predetermined sorting categories were developed based upon both empirical and
theoretical grounds; the teachers‟ think-aloud verbal protocols were used as guiding
empirical sources, complemented by theories that define and assess the construct of ESL
academic writing. I read each descriptor iteratively in order to identify the writing skills
that best represented the characteristics of the descriptors. When the skills were
empirically identified, they were sequentially compared and confirmed according to
theories of ESL writing and a variety of existing ESL writing assessment schemes.
During the sorting category finalization process, the following was taken into
consideration: 1) writing skills should be conceptually distinguishable from each other; 2)
each writing skill should have a minimal number of descriptors in order to be considered
for the statistical testing of dimensionality structures; and 3) writing skills should be
comparable to those specified in existing ESL writing assessment schemes for cross-
validation purposes. The sorting categories created through this process included five
skills: content fulfillment (CON), organizational effectiveness (ORG), grammatical
knowledge (GRM), vocabulary use (VOC), and mechanics (MCH). These writing skills
were consistent with the assessment components discussed in Chapter 2 (see Table 1) and
consistent with the assessment criteria described in Jacobs et al.‟s (1981) scale.
Each writing expert received a set of index cards on which the reviewed EDD
descriptors were reproduced. The experts were first asked to skim through these cards,
and then to sort them into piles that they thought represented distinct ESL writing skills.
This first sorting activity was conducted based solely on the experts‟ own skill
configuration. When they thought a descriptor was associated with multiple writing skills,
they labeled them as primary or secondary. In the second phase, the experts used the
predetermined sorting categories to assign the descriptors to appropriate skills. The
purpose of this second sorting activity was to construct a Q-matrix. Before they sorted
the descriptors, the sorting categories were explained and the experts were asked whether
they thought the five writing skills (content fulfillment, organizational effectiveness,
grammatical knowledge, vocabulary use, and mechanics) were comprehensive enough to
![Page 85: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/85.jpg)
73
represent the content of all of the descriptors. Detailed definitions or descriptions of each
skill category were not provided, so that the experts‟ mapping of the descriptors onto
skills was not restricted. They were further asked to mark those descriptors that matched
with multiple or none of the skill categories. The sorting activity lasted approximately
one hour for each teacher, and their verbal accounts were tape recorded with their
permission.
The EDD checklist was constructed based upon the experts‟ review outcomes
derived from the focus group meeting. Overlapping concerns or suggestions were taken
into consideration when the descriptors were refined and finalized to constitute the EDD
checklist. The refinement process was iterative, and careful attention was paid to each
descriptor‟s wordings. When the refinement was completed, two marking boxes, labeled
yes or no, were attached to each descriptor to create a checklist form (see Appendix I).
The experts‟ sorting outcomes were also reviewed carefully in order to identify areas of
substantial agreement or disagreement. The result of the second sorting activity identified
the skills-by-descriptors relationships that finally constructed a Q-matrix.
Phase 2
Pilot Study
ESL academic writing teachers‟ essay assessment
Seven ESL writing teachers participated in the pilot study to assess 80 TOEFL
iBT independent essays. Forty TOEFL iBT independent essays were selected from each
of two essay pools using a stratified sampling procedure (40 essays × 2 prompts) and
formed into essay batches. Each essay batch consisted of 10 essays representing all
proficiency levels on the two prompts (5 levels × 2 prompts).14
Table 9 shows the
distribution of the essay batches assigned to the teachers. Each teacher assessed three
essay batches, with one essay batch (Batch 03) assessed by all seven teachers and the
remaining seven batches assessed by two different teachers.15
The three essay batches
assigned to each teacher were ordered by the prompt and counterbalanced. Three teachers
assessed essays that were written on the subject prompt first, and the other four teachers
14
The five levels were roughly determined because each level did not have equal number of essays (see
Table 6 for the score distribution of the essays). 15
Batch 03 functioned as an anchor set linking all assessment facets in the FACETS analysis.
![Page 86: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/86.jpg)
74
assessed essays that were written on the cooperation prompt first.
Rater training was held prior to the teachers‟ essay assessment. The purpose of
the training was to orient the teachers to the EDD checklist, not to clone them to achieve
high inter-rater reliability. An individual meeting was set up with each of the teachers to
explain the purpose of the study and to outline the checklist‟s development procedure in
greater detail.16
Each descriptor was explained using concrete examples; the yes or no
option was also explained, and the difficulty of determining a cut-off for yes or no was
acknowledged. The general rule of thumb was that if a teacher thought that a writer
generally met the criteria of the descriptor, it was considered a yes; otherwise it was
considered a no. The term generally indicated the state in which a teacher did not feel
distracted or the teacher‟s comprehension was not compromised by a student‟s mistake
on the skill being assessed. The training was informal in order to minimize potential
psychological pressure that might affect the teachers‟ assessment.
Table 9
Distribution of Essay Batches in the Pilot Study
Teacher Essay batch
Angelina Batch 03 Batch 01 Batch 07
Ann Batch 03 Batch 05 Batch 08
Beth Batch 03 Batch 01 Batch 05
Brad Batch 03 Batch 04 Batch 06
Esther Batch 03 Batch 02 Batch 08
Susan Batch 03 Batch 04 Batch 07
Tom Batch 03 Batch 02 Batch 06
In order to ensure that the training had been successful, the teachers were asked
to assess one essay sample using the EDD checklist. Specifically, they were asked to (a)
make a yes or no decision for each descriptor, and (b) indicate their confidence levels on
each descriptor. The rationale for requiring teacher confidence levels was to identify the
descriptors that the teachers felt were difficult to use. They were asked to indicate their
confidence levels anywhere along the continuum between 0% and 100%, with 0%
indicating the lowest confidence level and 100% indicating the highest. While the
16
Training with one teacher was delivered via email because she could not attend on-site training.
![Page 87: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/87.jpg)
75
teachers were marking the essay, they were left alone in a quiet room. The assessment
took approximately 15 minutes, after which a debrief session was held. Teachers were
asked to report any concerns with or suggestions for using the EDD checklist; retraining
took place, if necessary, depending on the gravity of these concerns. The entire training
session lasted approximately one hour.
Upon completing their training, the teachers were provided with an assessment
package containing (a) 30 essays (15 essays × 2 prompts), (b) the EDD checklist (see
Appendix J), (c) the assessment guidelines (see Appendix K)17
, and (d) the Teacher
Questionnaire I. They were asked to assess 30 essays, but to indicate their confidence
levels on just 10 essays (5 essays × 2 prompts) in order to save time and to make them
focus on the assessment itself. The turnaround time for assessment results was within two
weeks of the training date. Once the teachers had completed their assessments, they were
asked to fill out the Teacher Questionnaire I and were interviewed for 30 to 45 minutes.
The interview focused on their evaluations of the checklist‟s quality and effectiveness.
All assessment materials were collected after the assessments for security purposes. The
score data were entered into Microsoft® Excel spreadsheets, and the questionnaire data,
including teacher background information, and the interview data were entered into
Microsoft® Word. When score data were entered, a yes response was treated as “1” and a
no response was treated as “0”.
Preliminary analyses of the EDD checklist
The data collected in the pilot study were analyzed in order to examine the
validity assumptions concerning the use of the EDD checklist and to further fine-tune the
methodology of the main study. Three validity assumptions were examined:
The scores derived from the EDD checklist are generalizable across different
teachers and essay prompts (Teacher and essay prompt effects).
Performance on the EDD checklist is related to performance on other measures
of ESL academic writing (Correlation between EDD scores and TOEFL scores).
The EDD checklist helps teachers make appropriate diagnostic decisions and has
the potential to positively impact teaching and learning ESL academic writing
(Teacher Perceptions and Evaluations).
17
The assessment guidelines were carefully scripted so that they could be used as a reference for the
teachers.
![Page 88: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/88.jpg)
76
a. Analysis of teacher and prompt effects
A Many-faceted Rasch Model (MFRM) was used to examine the extent to which
the student writing scores obtained from a sample of teachers on a sample of essay
prompts were generalizable beyond that specific set of teachers and essay prompts.
MFRM estimates the latent ability of a student while taking test conditions into
consideration. The rating probability for a particular student on a certain descriptor from
a particular teacher can be predicted mathematically from given facets, such as the ability
of the student, the difficulty of the descriptor, and the severity of the teacher. All facets
are placed on a single common logit scale, with the measurement units expressed as
logits.
The 7,326 valid ratings awarded by the seven teachers using the 35 descriptors
on the 80 essays were entered into the MFRM computer software, FACETS version
3.66.0 (Linacre, 2009).18
Ten essays were assessed by all seven teachers, and the
remaining 70 essays were assessed by two different teachers so that the data matrix was
partially crossed. In the model specification, four facets were specified: student, prompt,
teacher, and descriptor. While the student and descriptor facets were centered by
anchoring logit means at zero, the teachers were allowed to float because the analysis of
interest was focused on the teacher behavior in using the EDD checklist. The prompt
facet was entered as a dummy facet and anchored at preset values of 0.03 logits (for the
cooperation prompt) and -0.03 logits (for the subject prompt), respectively.19
Anchoring
was necessary in order to connect the two separate essay subsets in which each student
wrote a single essay on one prompt only. The preset values of 0.03 and -0.03 logits were
derived from a preliminary analysis that showed the subject prompt (difficulty measure =
0.03 logits) was more difficult than the cooperation subject (difficulty measure = -0.03
logits).
The analysis of the teacher and prompt effects was conducted using multiple
methods. First, teacher internal consistency was examined: teachers who exhibited
misfitting or overfitting rating patterns were detected based on infit and outfit mean
square values. In addition, inter-teacher reliability was examined in order to explore the
18
Each teacher assessed 30 essays using the 35 descriptors and there were 24 missing responses ([7
teachers × 30 essays × 35 descriptors] - 24 ratings = 7,326 ratings). 19
Dummy facets are intended to investigate interactions without affecting main effects (Linacre, 2009).
![Page 89: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/89.jpg)
77
degree to which one teacher agreed with others when using the EDD checklist. Three
reliability indices were computed: (a) the percentage of exact agreement, (b) point-
biserial correlation, and (c) the percentage of the teachers‟ ratings that agreed on each
descriptor. Finally, a bias analysis was carried out in order to further investigate the ways
in which the descriptors interacted with the teachers and the prompts. Score
generalizability could not be examined across different prompts because students did not
write essays on both of the prompts (i.e., they only wrote one essay on one prompt).
Instead, the extent to which the EDD descriptors are biased for or against the prompts
was examined to determine whether the descriptors functioned consistently across
different essay prompts.
b. Analysis of correlation between EDD scores and TOEFL scores
A correlation analysis was conducted in order to examine the extent to which
scores awarded using the EDD checklist were consistent with those awarded using the
TOEFL iBT independent writing rating scale. Specifically, a Pearson product-moment
correlation coefficient was computed to estimate the strength of the association between
the logit scores elicited from the MFRM analysis and the original TOEFL iBT
independent writing scores awarded by ETS raters on the 80 essays.
c. Analysis of teacher perceptions and evaluations
The examination of teacher perceptions and evaluations of the use of the EDD
checklist focused on their reported confidence levels and their responses to the
questionnaire and in the interviews. The extent to which the teachers felt confident using
the EDD checklist was examined using descriptive statistics. A mean was calculated in
order to examine the degree to which the teachers felt confident in their assessments
across the 35 descriptors on 10 essays (5 essays × 2 prompts). The descriptors with the
highest and lowest confidence levels were also identified. To further explore the
relationship between teacher confidence and agreement, the two sets of scores were
plotted in the same graph. Teacher responses to the questionnaire and in the interviews
were also analyzed in order to examine how they judged the use of the EDD checklist.
The responses to the Likert-scale items were analyzed according to frequency, and the
![Page 90: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/90.jpg)
78
responses to the open-ended items were analyzed descriptively. The teachers‟ written
comments and interview transcripts were read iteratively in order to identify positive and
negative reactions to the use of the EDD checklist. The interview results were then
integrated with those collected in the main study in order to develop a more
comprehensive picture of the teachers‟ evaluations.
Main Study
ESL academic writing teachers‟ essay assessment
The main study was carried out two months after the pilot study was conducted.
Ten ESL teachers assessed 480 TOEFL iBT independent essays using the EDD checklist.
These essays were divided into 40 essay batches, with each essay batch consisting of 12
essays representing all proficiency levels on the two prompts (6 essays × 2 prompts).
Table 10 shows the distribution of the essay batches assigned to the teachers. Unlike the
pilot study, it was not necessary to include a linking essay subset because the analytic
technique employed in the main study did not require a crossed data matrix. The teachers
were assigned four essay batches which were further divided into two assessment
packages. Each of the two assessment packages included 24 essays written on the two
prompts (12 essays × 2 prompts) and was counterbalanced. Five teachers assessed essays
that were written on the subject prompt first, while the other five teachers assessed essays
that were written on the cooperation prompt first.
Table 10
Distribution of Essay Batches in the Main Study
Teacher Essay batch
Angelina Batch 01 Batch 02 Batch 03 Batch 04
Ann Batch 05 Batch 06 Batch 07 Batch 08
Beth Batch 09 Batch 10 Batch 11 Batch 12
Brad Batch 13 Batch 14 Batch 15 Batch 16
Erin Batch 17 Batch 18 Batch 19 Batch 20
Greg Batch 21 Batch 22 Batch 23 Batch 24
Kara Batch 25 Batch 26 Batch 27 Batch 28
Sarah Batch 29 Batch 30 Batch 31 Batch 32
![Page 91: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/91.jpg)
79
Table 10 (Continued)
Teacher Essay batch
Susan Batch 33 Batch 34 Batch 35 Batch 36
Tom Batch 37 Batch 38 Batch 39 Batch 40
Training took place in the same manner as in the pilot study. The teachers
engaged in individual meetings to discuss the purpose of the study and the assessment
procedure. Training with the four teachers who did not participate in the pilot study was
intensive, while training with the six teachers who participated in the pilot study focused
primarily on their questions and concerns about using the checklist according to the
revised assessment guidelines (see Appendix L20
). Upon completion of training, teachers
were given the first assessment package containing (a) 24 essays (12 essays × 2 prompts),
(b) the EDD checklist, and (c) the assessment guidelines. The teachers were asked to
assess 24 essays, but to indicate their confidence levels on just 10 essays (5 essays × 2
prompts). The turnaround for the first assessment results was within two weeks of the
training date. When the teachers returned their first assessment outcomes, they were
interviewed for 30 to 45 minutes to discuss the effectiveness of the checklist.
The second assessment took place two weeks after the first. Of the 10 teachers
who participated in the first assessment, eight went on to participate in the second
assessment. Two teachers were unable to participate for personal reasons, and the essays
assigned to them were scored by other participating teachers based upon availability.
Four teachers marked a set of 24 essays (12 essays × 2 prompts), and one teacher and
three teachers marked 48 essays (24 essays × 2 prompts) and 32 essays (16 essays × 2
prompts), respectively, to make up the assessments. The second assessment package
contained (a) essays written on two prompts, (b) the EDD checklist, (c) the assessment
guidelines, and (d) the Teacher Questionnaire II, and was distributed to the teachers with
a reminder that there should be at least a two-week interval between the first and the
second assessment; this was done to examine whether the teachers could use the EDD
checklist reliably, and to determine how their perceptions of the EDD checklist changed
over time. After the teachers completed the second assessment, they were administered
20
The assessment guidelines used in the main study were slightly revised based upon the teachers‟
comments in the pilot study in order to enhance the clarity of the descriptors.
![Page 92: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/92.jpg)
80
the Teacher Questionnaire II and were interviewed. Unlike the first interview, this second
interview focused specifically on the extent to which the teachers thought the use of the
checklist would have a positive impact on classroom instruction and assessment. The two
teachers who were unable to participate in the second assessment round completed
Teacher Questionnaire II after the first assessment round. All interviews were tape
recorded with the permission of the teachers. When the entire assessment was completed,
all of the assessment materials were collected for security purposes. The score data and
the questionnaire data, including teacher background information, were entered into
Microsoft® Excel spreadsheets, and the interview data were transcribed using Microsoft
®
Word.
Main analyses of the EDD checklist
The data collected in the main study were analyzed in order to examine the
validity assumptions concerning the use of the EDD checklist. Three validity
assumptions were examined:
The EDD checklist provides a useful diagnostic skill profile for ESL academic
writing (Characteristics of the diagnostic writing skill profiles).
Performance on the EDD checklist is related to performance on other measures
of ESL academic writing (Correlation between EDD scores and TOEFL scores).
The EDD checklist helps teachers make appropriate diagnostic decisions and has
the potential to positively impact teaching and learning ESL academic writing
(Teacher perceptions and evaluations).
Each facet of the assumptions provided valuable information used to justify the
validity claims for use of the EDD checklist. The results derived from the justification
process were integrated and synthesized in a complementary manner.
a. Characteristics of the diagnostic writing skill profiles
The fundamental assumption of diagnosis modeling is that the test construct of
interest is multidimensional rather than unidimensional. Under this assumption, the
ability parameter is placed onto multidimensional space representing the skills-by-items
relationship. Before estimating several parameters using diagnosis modeling, both
substantive and statistical dimensionality analyses were conducted to ensure that the
construct of ESL academic writing is multi-divisible and the diagnostic approach is well
![Page 93: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/93.jpg)
81
grounded. The substantive analysis was carried out based upon the outcomes of the ESL
academic writing experts‟ descriptor sorting activity, in which the refined EDD
descriptors were sorted into dimensionally distinct ESL writing skills. The statistical
analysis was also conducted using a series of conditional covariance-based
nonparametric dimensionality techniques. The ratings awarded by 10 teachers on 480
TOEFL iBT independent essays constituted the primary dataset for dimensionality
analysis.
Three nonparametric dimensionality tests were implemented in this study: (a)
DIMTEST (Stout, Froelich, & Gao, 2001), (b) CCPROX/HCA (Roussos, Stout, &
Marden, 1998), and (c) DETECT (Zhang & Stout, 1999). DIMTEST is a statistical
significance test that evaluates the null hypothesis that two sets of items taken by the
same examinees, AT (assessment subtest) and PT (partitioning subtest), are
dimensionally similar to each other. AT items are selected either in an exploratory or
confirmatory manner based upon theoretical considerations including expert review or
empirical data analysis, such as cluster analysis. When the null hypothesis is rejected, the
dimensionality test statistics, T, is referred to in order to estimate the magnitude of the
AT‟s dimensionality distinctiveness. A greater T value indicates a greater departure from
unidimensionality.
CCPROX/HCA is an exploratory item cluster analysis that neither conducts a
significance test nor provides the magnitude of multidimensionality. Instead, it presents
the dimensional structure of a test visually, with each item constituting its own cluster
and successively combining pairs of clusters which are thought to be dimensionally
homogeneous until all of the items are joined into one large cluster. Of the many methods
determining proximity between clusters, the unweighted pair group method of averages
([UPGMA], Sokal & Michener, 1958) has been known to provide the most accurate item
classification (Douglas, Kim, Roussos, Stout, & Zhang, 1999). In order to achieve the
best cluster solution, other dimensionality procedures (such as DIMTEST, DETECT, and
content review) have been recommended to be used in conjunction with CCPROX/HCA
analysis (Douglas et al., 1999).
DETECT is an exploratory or confirmatory nonparametric dimensionality
technique that estimates the number of dimensions present in a test and the magnitude of
![Page 94: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/94.jpg)
82
multidimensionality. It also identifies dimensionally homogeneous clusters by calculating
the mean conditional covariance between all possible pairs of items in a test. The output
of DETECT analysis presents three useful indices including (a) DETECT index (or effect
size), (b) IDN index, and (c) r index. DETECT index is an overall conditional-covariance
estimator that indicates the magnitude of multidimensionality. According to Douglas et al.
(1999), when a DETECT index is less than 0.1, the test can be considered
unidimensional; an index between 0.1 and 0.5 indicates a weak degree of
multidimensionality; an index between 0.5 and 1.0 indicates a moderate degree of
multidimensionality; and an index between 1.0 and 1.5 indicates a strong degree of
multidimensionality. The other two indices, IDN index and r index, are associated with
the extent to which the data approximate simple structure, with values closer to 1
indicating the data closer to simple structure.
The latent dimensional structure of ESL academic writing ability was examined
in both exploratory and confirmatory manners. In an exploratory DIMTEST analysis, AT
items were selected using its built-in program, ATFIND, and were tested against the
remaining PT items several times until the DIMTEST failed to reject the null hypothesis.
Each time the null hypothesis was rejected, the initial AT items were removed from the
next run. The magnitude of multidimensionality was also examined in an exploratory
DETECT analysis. An exploratory CCPROX/HCA procedure further informed the
dimensional structure of the data. The result of the CCPROX/HCA analysis developed a
hypothesis of dimensionality and identified items that could be used as an AT set.
DIMTEST was then conducted iteratively with varying AT sets in a confirmatory manner.
The findings from the three different methods were complemented in order to determine
the dimensional structure of ESL academic writing ability.
Diagnosis modeling was then carried out using the Reduced RUM. The Q-matrix
developed by the ESL writing experts and the ratings awarded by the 10 teachers were
entered into the Reduced RUM computer software, Arpeggio version 3.1 (DiBello &
Stout, 2008). After the first Arpeggio run, model parameters were estimated using a
Markov Chain Monte Carlo (MCMC) algorithm. This procedure estimated model
convergence to the desired posterior distribution after discarding the burn-in steps. Three
different types of plots were visually inspected to determine whether the Markov Chain
![Page 95: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/95.jpg)
83
for each model parameter converged to a stationary solution. These plots included (a)
estimated posterior distributions, (b) chain plots, and (c) autocorrelations of the chain
estimates.
Given that the model estimation had converged, descriptor parameters and
estimates for the ability distribution for the skills were evaluated. The descriptor (d)
parameter estimates *
d and r *
dk were examined in order to determine the quality of
each descriptor relative to its required skills. Specifically, the descriptor parameter *
d
was inspected in order to estimate the probability that students had correctly executed all
skills required by a descriptor on the condition that they had mastered all required skills.
The other descriptor parameter r *
dk was inspected in order to determine the extent to
which a descriptor discriminated for the corresponding skill. When a descriptor was
found to not contribute much information for distinguishing masters from non-masters of
a given skill, the skill-by-descriptor entry was eliminated from the Q-matrix. Refinement
of the Q-matrix was carried out iteratively in a stepwise manner based upon both
substantive and statistical evidence. In addition, the skill parameter estimates pk were
examined in order to determine whether the proportion of masters on each skill was
congruent with the skill hierarchy of ESL writing. When a skill turned out to be more
difficult or easier than suggested by ESL academic writing theories, the Q-matrix was
revised and the difficulty levels of the descriptors that were assigned to that particular
skill were examined. If necessary, the reassignment of descriptors to a skill was
considered.
After the parameters were estimated, model fit was evaluated using posterior
predictive model checking methods. A residual analysis was conducted to examine the
model fit. The mean absolute difference (MAD) between observed and predicted item
proportion-correct scores was computed, with a smaller MAD indicating a better model
fit. The fit between the observed and predicted score distributions was also visually
inspected, with the two score distributions plotted onto the same graph to facilitate
comparison. If a substantial discrepancy between the two plots was observed, further
analysis was conducted. In addition, the relationship between the number of mastered
skills and the observed total scores was examined. The monotonic relationship between
![Page 96: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/96.jpg)
84
the two variables was assumed to support the claim for a good fit.
After model convergence and fit was achieved, the quality of the diagnostic
model was examined by testing several hypotheses. The first hypothesis tested whether
the estimated diagnostic model resulted in significant performance difference between
masters and non-masters. If the diagnostic model is well-constructed, it was assumed that
it would have the proportion-correct scores of masters distinguishably higher than those
of non-masters across all the descriptors. Descriptors with weak diagnostic capacity were
identified and further analyzed. The second hypothesis tested whether the estimated
diagnostic model could accurately classify examinees into appropriate skill mastery state
categories. The number of skill masters, the skills probability distribution, and the most
common skill mastery patterns were checked to examine the accuracy of the
classification. The third hypothesis tested the consistency of the skill mastery
classification. Simulated examinee item response data were used to estimate several
reliability indices. The fourth hypothesis tested the extent to which the diagnostic model
was affected by method effect. The skill mastery profiles generated by the model were
compared across the two essay prompts. The fifth and final hypothesis tested whether the
estimated diagnostic model resulted in significantly different skill profiles across
different writing proficiency levels. The 480 students were categorized into beginner,
intermediate, and advanced groups according to their TOEFL independent writing scores,
and the characteristics of their writing skill profiles were compared. In addition to
evaluating the five hypotheses, a case analysis was conducted to closely examine the
quality of the estimated skill mastery profiles.
b. Analysis of correlation between EDD scores and TOEFL scores
A correlation analysis was conducted to examine the extent to which scores
awarded using the EDD checklist were consistent with those awarded using the TOEFL
iBT independent writing rating scale. Specifically, a Pearson product-moment correlation
coefficient was computed on the observed scores awarded by the teachers using the EDD
checklist and the original TOEFL iBT independent writing scores awarded by ETS raters
on the 480 essays in order to estimate the strength of the association between the two sets
of scores.
![Page 97: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/97.jpg)
85
c. Analysis of teacher perceptions and evaluations
The examination of the teacher perceptions and evaluations of the use of the
EDD checklist focused primarily on their responses to the questionnaire and in the
interviews. It should be noted that their confidence data could not be analyzed because of
too many missing responses. Teachers were reluctant to report their confidence levels in
the main study for two reasons: reporting confidence levels for all the descriptors
required too much time, and teachers felt that their ratings were affected by the act of
indicating their confidence levels. Some teachers also mentioned that this caused them to
feel monitored by the researcher. For these reasons, teachers‟ confidence levels were not
analyzed and reported in the main study; however, their questionnaire and interview
responses were usable. The responses to the Likert-scale items on the questionnaire were
analyzed descriptively according to frequency. The qualitative accounts on the
questionnaire and the interview were examined using a thematic data analytic method.
Teachers‟ questionnaire comments and interview transcripts were read iteratively, and
emerging themes associated with their evaluations were identified. Each theme was
constantly compared with others, with similar themes grouped together. The results from
the quantitative and qualitative analyses were then integrated and synthesized when the
study‟s findings were interpreted and discussed.
Summary
This chapter proposed five research questions formulated based upon the
reasoning process of validity arguments:
1) What empirically-derived diagnostic descriptors are relevant to the construct
of ESL academic writing?
2) How generalizable are the scores derived from the EDD checklist across
different teachers and essay prompts?
3) How is performance on the EDD checklist related to performance on other
measures of ESL academic writing?
4) What are the characteristics of the diagnostic ESL academic writing skill
profiles generated by the EDD checklist?
5) To what extent does the EDD checklist help teachers make appropriate
![Page 98: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/98.jpg)
86
diagnostic decisions and have the potential to positively impact teaching and
learning ESL academic writing?
Each research question addressed each facet of validity inferences for the score-
based interpretation and use of the EDD checklist in ESL academic writing. It guided a
set of comprehensive procedures for the development of the checklist and for the
justification of its score-based interpretations and uses. A mixed methods research design
was chosen in order to build and support arguments that the EDD checklist assesses ESL
writing ability required in an academic context and provides fine-grained diagnostic
information about various writing skills. A series of validity assumptions determined the
types of data to be collected, which were then analyzed and synthesized using both
quantitative and qualitative methods. The next three chapters discuss the evaluation of a
series of validity claims for the use of the EDD checklist.
![Page 99: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/99.jpg)
87
CHAPTER 4
DEVELOPMENT OF THE EDD CHECKLIST
Introduction
This chapter discusses the development of the EDD checklist conducted in Phase
1. One central validity claim was that the descriptors that constitute the checklist reflect
knowledge, processes, and strategies consistent with the construct of ESL writing
required in an academic context. In order to evaluate this assumption, fine-grained
descriptors representing ESL academic writing skills were empirically identified using
detailed verbal descriptions of ESL academic writing ability provided by nine ESL
teachers. The think-aloud verbal protocols were open-coded based upon grounded theory
and sequentially confirmed by theoretical accounts. Four ESL academic writing experts
reviewed and refined the identified descriptors to come up with the final EDD checklist.
In this chapter, the empirically-derived descriptors are systematically validated based
upon theories found in ESL academic writing assessment literature in order to make a
theory-based inference about the checklist‟s characteristics.
Identification of EDD Descriptors
Writing in a second language (L2) is a cognitively complex communicative act,
involving the use of multi-faceted and complicated language skills and knowledge that
directly and indirectly affect writing performance. This multidimensional view of L2
writing was evident from the think-aloud verbal protocols collected and analyzed in
Phase 1 of this study. Initial review of these protocols indicated that ESL teachers
considered a variety of subcomponents of ESL writing skills and knowledge when
determining the quality of an essay. The sheer amount of data they provided also
confirmed the depth and comprehensiveness of the teachers‟ accounts. The coding of
these protocols resulted in the identification of 39 recurrent writing subskills that formed
the descriptors of the EDD checklist (see Table 11). These descriptors were empirically-
derived, concrete, and fine-grained, addressing all aspects of ESL writing skills such as
content fulfillment, organizational effectiveness, grammatical knowledge, vocabulary use,
and mechanics.
![Page 100: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/100.jpg)
88
Table 11 lists all 39 descriptors of ESL academic writing and the number of
times they occurred during think-aloud verbalization. The total descriptor tally was 1,715,
of which spelling (D3421
, 6.06%), essay structure (D9, 5.42%), verb tense (D22, 4.90%),
tone and register (D39, 4.78%), and essay clarity (D2, 4.72%) were the five most
frequently mentioned. By contrast, essay focus (D14, 0.87%), indentation (D37, 0.58%),
use of conditional verbs (D28, 0.52%), syntactic variety (D16, 0.35%), and paraphrasing
(D38, 0.29%) were the least frequently commented upon.
21
D34: Descriptor 34. Hereafter, the notation, “D + number” will indicate “Descriptor + number.”
![Page 101: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/101.jpg)
89
Table 11
39 Descriptors of ESL Academic Writing Skills
Descriptor f %
1. This essay demonstrates an understanding of the topic and answers a specific question. 67 3.91
2. This essay is written clearly enough to be read without inferring or interpreting the meaning. 81 4.72
3. This essay is concise, containing few redundant ideas or linguistic expressions. 21 1.22
4. The beginning of the essay contains a clear thesis statement. 34 1.98
5. The main arguments in this essay are strong. 47 2.74
6. There are sufficient supporting ideas and examples in this essay. 24 1.40
7. The supporting ideas and examples in this essay are logical and appropriate. 64 3.73
8. The supporting ideas and examples in this essay are specific and detailed. 21 1.22
9. The ideas are organized into paragraphs and include an introduction, a body, and a conclusion. 93 5.42
10. Each paragraph is complete, with a clear topic sentence tied to its supporting sentences. 34 1.98
11. Each paragraph presents one distinct and unified idea in a coherent way. 28 1.63
12. Each paragraph links well to the rest of the essay. 18 1.05
13. Ideas are developed or expanded throughout each paragraph. 58 3.38
14. Ideas reflect the central focus of the essay, without digressing. 15 0.87
15. Transition devices are used effectively. 56 3.27
16. Syntactic variety is demonstrated in this essay. 6 0.35
17. Complex sentences are used effectively. 53 3.09
18. Normal word order is followed except in cases of special emphasis. 19 1.11
19. Sentences are well-formed and complete, and are not missing necessary components. 62 3.62
20. Independent clauses are joined properly, using a conjunction and punctuation, with no run-on sentences or
comma splices. 41 2.39
21. Major grammatical or linguistic errors impede comprehension. 42 2.45
22. Verb tenses are used appropriately. 84 4.90
23. There is agreement between subject and verb. 64 3.73
24. Singular and plural nouns are used appropriately. 40 2.33
![Page 102: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/102.jpg)
90
Table 11 (Continued)
Descriptor f %
25. Prepositions are used appropriately. 44 2.57
26. Articles are used appropriately. 52 3.03
27. Anaphora (i.e., pronouns) reflects appropriate referents. 51 2.97
28. Conditional verb forms are used appropriately. 9 0.52
29. Sophisticated or advanced vocabulary is used. 48 2.80
30. A wide-range of vocabulary is used, with minimal repetition. 16 0.93
31. The meaning of vocabulary is understood correctly and used in the appropriate context. 53 3.09
32. The essay demonstrates facility with collocations, and does not contain unnatural word-by-word translations. 25 1.46
33. Words change their forms where necessary and appropriate. 59 3.44
34. Words are spelled correctly. 104 6.06
35. Punctuation marks are used correctly. 66 3.85
36. Capital letters are used appropriately. 19 1.11
37. The essay contains appropriate indentation. 10 0.58
38. The essay prompt is well-paraphrased, and is not replicated verbatim. 5 0.29
39. Appropriate tone and register are used throughout the essay. 82 4.78
Total 1,715 100.00
![Page 103: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/103.jpg)
91
When inter-coder reliability was examined, satisfactory agreement (450/515
segments, 87.38%) was found. Agreement at the individual descriptor level was also
reasonable, ranging from 70% to 100% (see Table 12). The areas of least agreement (i.e.,
< 80%) were idea development (D13, 70%) and use of collocations (D32, 70%),
followed by syntactic variety (D16, 75%), word sophistication (D29, 75%), word choice
(D31, 76.92%), and use of punctuation marks (D35, 78.57%). The second coder and I re-
examined the areas of disagreement in order to resolve discrepancies. Most
disagreements were reconciled following discussion. When no agreement was reached, I
decided which code would be assigned.
Table 12
Inter-Coder Reliability for the 39 Descriptors
Descriptor No. of segments No. of agreed segments Agreement (%)
D01 20 17 85.00
D02 20 18 90.00
D03 0 0 –
D04 27 24 88.89
D05 26 22 84.62
D06 10 8 80.00
D07 20 18 90.00
D08 4 4 100.00
D09 35 29 82.86
D10 15 12 80.00
D11 13 12 92.31
D12 6 6 100.00
D13 10 7 70.00
D14 0 0 –
D15 14 12 85.71
D16 8 6 75.00
D17 24 20 83.33
D18 4 4 100.00
D19 14 12 85.71
D20 10 8 80.00
D21 16 14 87.50
D22 26 26 100.00
D23 10 8 80.00
D24 24 22 91.67
D25 20 20 100.00
D26 10 10 100.00
D27 16 14 87.50
D28 4 4 100.00
![Page 104: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/104.jpg)
92
Table 12 (Continued)
Descriptor No. of segments No. of agreed segments Agreement (%)
D29 8 6 75.00
D30 4 4 100.00
D31 13 10 76.92
D32 20 14 70.00
D33 10 8 80.00
D34 30 30 100.00
D35 14 11 78.57
D36 0 0 –
D37 2 2 100.00
D38 0 0 –
D39 8 8 100.00
Total 515 450 87.38
Note. D03, D14, D36, and D38 achieved 0% agreement because the selected 515 segments did not include
these descriptors at all. As indicated in Table 11, the low frequency of these descriptors might have caused
a sampling problem.
Table 13 tallies the frequency of the 39 descriptors across teachers and essay sets
in greater detail. Overall frequency counts differed greatly from teacher to teacher: Ann
provided the greatest number of comments based on 34 descriptors, while Judy provided
the least number of comments based on 24 descriptors. In addition, the teachers who
were assigned to Essay Set 1 produced more comments than those assigned to Essay Sets
2 and 3. When frequency patterns were closely examined, counts seemed to be affected
by both the teachers‟ teaching experience and the length of the essay. Ann, the most
experienced teacher with 25 years‟ ESL writing experience, was assigned to Essay Set 1,
which included longer essays. Judy, with eight years‟ teaching experience, was assigned
to Essay Set 3, which included shorter essays. This conjecture is tentative at this point,
but further research may identify such possible relationships accurately.
In order to ensure that the teachers‟ think-aloud accounts were reliable sources
for the EDD checklist, the essay scores they awarded were correlated with the original
TOEFL iBT independent writing scores awarded by ETS raters. Appendix M presents
the correlation matrices for the nine teachers. The magnitude of the association was
strong, with Pearson product-moment correlation coefficients between pairs of scores
ranging from r = .75 to r = .98, p < .05. This result confirmed that the verbal protocols
that the teachers generated were reliable sources from which to construct an assessment
tool for ESL academic writing.
![Page 105: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/105.jpg)
93
Table 13
Frequency of Descriptors by Teachers and Essay Sets
Descriptor Ann Shelley Sarah James Beth George Judy Tim Esther Essay Set 1 Essay Set 2 Essay Set 3 Total
D01 20 13 0 9 14 1 8 2 0 33 24 10 67
D02 5 17 10 7 6 8 10 16 2 32 21 28 81
D03 2 0 2 0 10 7 0 0 0 4 17 0 21
D04 1 5 12 5 2 7 0 0 2 18 14 2 34
D05 2 12 8 9 5 4 4 0 3 22 18 7 47
D06 0 6 2 2 4 7 2 1 0 8 13 3 24
D07 8 22 4 10 2 5 2 10 1 34 17 13 64
D08 5 7 0 2 1 2 0 2 2 12 5 4 21
D09 19 17 8 7 16 9 4 6 7 44 32 17 93
D10 8 1 6 4 7 7 0 0 1 15 18 1 34
D11 5 2 6 6 4 2 0 3 0 13 12 3 28
D12 7 1 4 1 5 0 0 0 0 12 6 0 18
D13 4 8 0 9 14 12 0 2 9 12 35 11 58
D14 0 0 0 2 2 11 0 0 0 0 15 0 15
D15 3 1 0 4 20 6 4 13 5 4 30 22 56
D16 0 0 0 3 1 1 0 0 1 0 5 1 6
D17 12 0 0 3 26 4 0 1 7 12 33 8 53
D18 3 0 0 0 2 3 2 3 6 3 5 11 19
D19 9 9 6 4 5 10 2 9 8 24 19 19 62
D20 3 5 4 3 5 8 2 9 2 12 16 13 41
D21 11 7 2 4 8 0 2 4 4 20 12 10 42
D22 10 9 6 1 20 9 10 12 7 25 30 29 84
D23 12 5 10 0 10 13 0 8 6 27 23 14 64
D24 2 4 4 0 2 10 12 6 0 10 12 18 40
D25 6 8 4 1 4 4 2 7 8 18 9 17 44
D26 11 0 4 0 5 17 6 2 7 15 22 15 52
![Page 106: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/106.jpg)
94
Table 13 (Continued)
Descriptor Ann Shelley Sarah James Beth George Judy Tim Esther Essay Set 1 Essay Set 2 Essay Set 3 Total
D27 18 3 2 2 4 5 4 9 4 23 11 17 51
D28 3 1 0 0 2 1 2 0 0 4 3 2 9
D29 4 7 4 1 8 15 2 2 5 15 24 9 48
D30 3 5 0 1 4 0 0 2 1 8 5 3 16
D31 7 12 0 6 7 6 6 7 2 19 19 15 53
D32 7 5 0 1 7 2 0 2 1 12 10 3 25
D33 9 7 6 2 3 24 2 5 1 22 29 8 59
D34 16 11 10 4 10 18 12 15 8 37 32 35 104
D35 14 2 6 2 7 5 10 8 12 22 14 30 66
D36 2 1 0 2 1 4 2 1 6 3 7 9 19
D37 0 2 2 1 0 2 0 3 0 4 3 3 10
D38 0 2 0 0 0 0 0 2 1 2 0 3 5
D39 26 11 14 4 1 13 8 0 5 51 18 13 82
Total 277 228 146 122 254 262 120 172 134 651 638 426 1715
Note. Essay Set 1 was assigned to Ann, Shelley, and Sarah.
Essay Set 2 was assigned to James, Beth, and George.
Essay Set 3 was assigned to Judy, Tim, and Esther.
![Page 107: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/107.jpg)
95
The 39 descriptors were reviewed based upon theories of ESL writing and a
variety of existing ESL writing assessment schemes before they were subjected to the
academic writing experts‟ substantive review and refinement process. As discussed below,
it was theoretically and practically reasonable for each of the descriptors to be included
in the EDD checklist. The next section discusses each descriptor along with empirical
and theoretical accounts related to ESL academic writing, as well as the ways in which it
manifests the quality of ESL academic writing.
Descriptor 1: This essay demonstrates an understanding of the topic and
answers a specific question.
The first descriptor that drew teachers‟ attention was whether the writer
addressed the given topic. Milanovic et al. (1996) termed this as task realization in
reference to the extent to which an essay meets the criteria set forth in the essay question.
Although it would seem to be a rudimentary requirement, seven teachers mentioned topic
fulfillment across all three essay sets, with a frequency of 67 (see Table 13 for frequency
tallies). For example, Ann described an essay in which the writer simply changed the
topic:
Again, that‟s off topic, you‟ve told him to decide and support it with reasons and
examples, and he‟s gone off into changing the topic. (Ann)
Beth and Judy also pointed out cases in which the writers attempted, but failed to
answer the question:
And this person starts talking about themselves and ends up talking about other
people, so again focus on the question, what is the question asking, and is the
question answered. (Beth)
And content… the content‟s fine. To me, he or she is staying with the same topic
about getting a good job, and so they explain about how it‟s important to study at
university, try to find what you‟re interested in, and hoping that if you do well in
your studies you get a good job, and then talking about how it‟s hard to negotiate
because if they choose a certain subject for the job they won‟t be interested in it
for the time, um, but then hoping they‟ll be able to change in the future. But as
far as, does it answer the question? I‟m not sure it does. (Judy)
![Page 108: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/108.jpg)
96
Another consistent focus was whether the writer answered the question
completely. Some essays written on the cooperation prompt provided an incomplete
answer to the topic. James‟ verbal report exemplifies this well:
This one is more on topic than the first one. This one, the task is to talk about “In
today‟s world, the ability to cooperate well with others is more important than it
was in the past,” the previous one doesn‟t mention the past really and this one at
least answers the question about the past. (James)22
Descriptor 2: This essay is written clearly enough to be read without inferring or
interpreting the meaning.
The second descriptor was related to the overall clarity of an essay, assessing
whether it read easily, with no extra effort required on the part of the reader to
understand the writer‟s meaning. This evaluation criterion is similar to what Hamp-
Lyons and Henning (1991) called communicative quality. All nine teachers responded to
the overall clarity of the essays, with frequency counts reaching 81; indeed, overall
clarity was one of the five most frequently-mentioned descriptors. In one case, Sarah and
Esther commented that it was necessary to reread the text to fully understand it:
A little bit… my first read-through of the sentences, yes I didn‟t understand, I
had to reread but after I reread it then I understood. (Sarah)
Okay, my global feeling on that one would be I need to go back and try to read it
without talking out loud. (Esther)
Similarly, Shelley and Ann reported that they had to guess the writer‟s intention:
A lot of kind of summaries of what this person thinks, people think, but no clear
sense of what the writer thinks. I mean in the conclusion… um… it becomes
clearer. So, you have to guess as you read what the writer‟s real opinion is in the
sense of what it might be. (Shelley)
I keep thinking. Maybe it is good and maybe I‟m just not getting it. It‟s very
nebulous, but he needs to condense it, he needs to be clearer in his production.
When he says, the argument should perhaps lay with the importance of studying
in itself. (Ann)
22
Italicized transcripts indicate text read directly from an essay.
![Page 109: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/109.jpg)
97
Descriptor 3: This essay is concise, containing few redundant ideas or linguistic
expressions.
The think-aloud verbal reports indicated that not all teachers considered
conciseness a primary concern when judging essays: just four teachers in two essay sets
pointed it out, with a total frequency of 21. Nonetheless, existing rating schemes do
recognize conciseness as an important essay quality. For example, Jacobs et al.‟s (1981)
rating scheme describes succinctness as a key variable affecting the coherence of written
text. Despite the low frequency count, teachers in this study conceptualized conciseness
in two different ways: idea conciseness and linguistic conciseness. George‟s verbal report
illustrates the former:
So the ideas are expressed quite concisely in the first paragraph. (George)
On the other hand, Beth, Judy, and Tim pointed out redundant linguistic
expressions:
I would probably consider that redundant, and have them incorporate it into the
topic sentence of the first essay, or into the thesis statement. (Beth)
But when it comes to this question I think it is hard to say which one is important,
people should consider „both‟ these „two‟ things carefully and make their own
choose, redundant. (Judy)
But I would like to make myself clear, that‟s a bit of a redundant phrase in that
it‟s unnecessary because obviously by writing you‟re doing that. (Tim)
Descriptor 4: The beginning of the essay contains a clear thesis statement.
The teachers felt that a clear thesis statement was a necessary aspect of good
writing. A well-formulated thesis statement functioned as an essay‟s road map, guiding
readers to the central idea on which the rest of an essay was built. It usually appeared at
the end of the first paragraph of an essay to preview the essay‟s main idea. The
importance of a thesis is also described in Jacobs et al.‟s (1981) ESL academic writing
profiles. Seven teachers commented on a thesis statement across all three essay sets, with
a frequency of 34:
I don‟t see any sort of overriding thesis statement or no main, um, statement,
outlining his or her argument as to what he‟s going to say, so I see that as a bit of
a weakness in this introductory paragraph. (George)
![Page 110: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/110.jpg)
98
Um, what else, so I think in terms of the question, the person has not really taken
a side, has not stated that they agree or disagree, but they‟d try to say that, it kind
of, what did they say, they said that it‟s hard to say which one is more important
and that people should consider both sides. I think the person could‟ve spent a
little bit more time on their thesis. (James)
And my argument is that the ability to cooperate well with others is far less
important than it was in the past, so there‟s a clear statement of what the
argument is, what this person‟s opinion is. (Shelley)
It‟s got a topic or thesis statement, it doesn‟t have controlling ideas in the thesis
statement but still it has a thesis statement and then the body supports the thesis
statement so that‟s good. (Sarah)
Descriptor 5: The main arguments in this essay are strong.
The writer‟s ability to present a strong argument may be the most critical
content-related evaluation criterion because the argument will significantly enhance or
downgrade essay quality. This is why Hamp-Lyons and Henning (1991) included
argument as an independent evaluation criterion when constructing ESL communicative
writing profiles. Eight teachers in this study indicated that they considered argument to
be an important factor in the determination of essay quality, resulting in 47 comments
across all three essay sets. For example, Esther focused on an argument that remained on
the fence:
I think they‟re trying here to, um, they‟re trying to hedge their bets. It‟s not a
great argument, it‟s a bit wish-wash. It‟s not a great argument, not a
sophisticated argument. (Esther)
The argument is not strong, not at all. It‟s sitting on the fence argument. Again
the other one was saying why can‟t we do both, but it was sophisticated,
they‟re… I think they‟re trying to fuse the idea that these things could go
together… it‟s just not done in a sophisticated way… yet. (Esther)
Shelley also focused on the strength of the argument:
And it‟s not a general statement that‟s really accepted as usually true, so it‟s a
very weak argument with nothing to support it. (Shelley)
It gives an actual, a more objective reason for his opinion, so the first kind of
argument or the first reason is just my opinion, saying this is in my opinion. This
one has a little more objectivity to it, interests can become your career, if you
choose interest you can still be choosing a career you like, it‟s a stronger
argument and it uses an example. (Shelley)
![Page 111: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/111.jpg)
99
Descriptor 6: There are sufficient supporting ideas and examples in this essay.
An essay‟s content features might also be measured through quantification. For
example, Kepner (1991) assessed the ideational quality of text by counting the number of
higher-level propositions. Similarly, Friedlander (1990) assessed the production of
content-related ideas based upon the number of specific details. In Hamp-Lyons and
Henning‟s (1991) rating scale, referencing was measured based on the number of
examples in an essay. Lumley‟s (2002) study also promoted the importance of this
assessment criterion, with raters focused on the quantity of ideas in writing even when
this was not specified in the rating scale. In this study, teachers found that sufficient ideas
and examples made the essays stronger. Seven teachers provided 24 comments on this
content feature across all three essay sets:
And just also giving more support behind your ideas because it‟s very minimal
so it‟s very brief. More support behind the ideas would be important. (George)
I‟m just not sure how many examples they really use to support. (Judy)
The support, not enough support, so first reason, this person based it on their
experience, but there‟s no reasons or details, so it would‟ve been helpful to say
for example, I want to be an engineer and I‟ve studied math and science,
something like that. (Shelley)
There‟s really only one reason. A good one would have three reasons for some
details. (Shelley)
Descriptor 7: The supporting ideas and examples in this essay are logical and
appropriate.
Logical and appropriate ideas and examples have long been considered
important evaluation criteria in ESL academic writing (cf., Brown & Bailey, 1984).
According to Witte (1983b), low-quality text does not provide appropriate elaboration on
a topic and usually requires readers to infer the intended meaning. The teachers in this
study also focused on the extent to which writers presented logical or appropriate ideas
and examples, providing 64 comments across all three essay sets. As their verbal reports
demonstrate, they tended to point out illogical ideas or examples:
My biggest problem with it for me is that it‟s very illogical. They‟ve also
supported what they‟ve said with some really interesting details, but it isn‟t
logical to say that individual means that we‟re alone and we don‟t require other
people. That‟s not what individual means. So, a basic lack of logic. (Shelley)
![Page 112: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/112.jpg)
100
The evolving society thing to me makes no sense either. One does not need
anyone to deliver mail, we have mail delivery every day. Has learned how to
make an argument but hasn‟t thought about how to make the argument logical,
but I wouldn‟t even write that on the paper until I talked to the writer and said
„tell me more about what you mean.‟ I wouldn‟t want to judge it harshly but to
me it‟s not logical. (Shelley)
For example, where was it?... the one about the bank accounts, we all have our
own bank accounts so that means we don‟t really need people, well… we do
kind of need people to help us build our bank accounts. Each of the points has
something like that. That‟s a little bit weak, but you know when you‟re writing
an essay in 30 minutes, that‟s to be expected. I wouldn‟t mark off for that really.
(Sarah)
There were also cases in which supporting ideas and examples were barely
connected to an essay‟s central question. James pointed out that a writer began well by
answering a given question, but deviated from the focus of the writing when providing
relevant support in the body of the paragraph:
So the person is clearly taking a stand in answering the question that the
importance of cooperation is um, more necessary today than ever, say that, the
adage of no man is an island is even more true today than ever and they break it
into two examples, a successful example of cooperation and a not so successful
example of cooperation, but these two examples don‟t really help to answer the
question of whether cooperation is more important today than it was in the past,
both examples are historical, they‟re about historical topics like the cold war and
the war in Iraq but they‟re not really… there‟s no comparison of past and present
in them, there‟s no sense of why cooperation is more important today than it
would‟ve been back in some other time. The person starts to get that kind of an
idea in their conclusion so they start to say things like, um, because of economic
globalization, the interconnection of the economy, trade-banking and services
are all connected so today more than ever, cooperation is necessary but then the
two examples, everything in the body doesn‟t really connect to that. (James)
Descriptor 8: The supporting ideas and examples in this essay are specific and
detailed.
The presentation of specific details was also an important criterion that
determined the content-related aspects of an essay. Although seven teachers provided
only 21 comments about the necessity of concrete examples, detailed supporting ideas
and examples strengthened the writers‟ arguments and improved reading comprehension.
The teachers‟ think-aloud verbal reports describe this point well:
Again, he hasn‟t given specific reasons and examples. (Ann)
![Page 113: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/113.jpg)
101
He provides a pretty sophisticated example that does support the topic. (George)
What‟s in a sense good about is that they are using some good examples.
They‟re giving concrete examples of what they‟re trying to prove. (Tim)
Um, one reason… gives the reason, gives an example with Toyota, another
reason. (Shelley)
Descriptor 9: The ideas are organized into paragraphs and include an
introduction, a body, and a conclusion.
It was evident that teachers in this study focused heavily on the overall design of
written text. All of the teachers provided a total of 93 comments across all three essay
sets, judging whether an essay followed the formal structure of an introduction, a body,
and a conclusion. Indeed, the ability to organize ideas into paragraphs was the second
most-frequently mentioned evaluation criterion. Beth and Shelley commented that they
paid particular attention to whether writers were able to outline their ideas using a well-
formulated paragraph structure:
My feeling is they have the ideas, but they haven‟t been able to sort of organize
and have a beginning, a middle, and an end. (Beth)
In terms of organization, she put all her reasons, instead of introducing them, she
did all the explanations she does for them in the opening paragraphs. She needs
to learn that the introduction just introduces the reason that she has. I think this is
probably a new paragraph and I have a sister because she‟s…, hit enter there.
(Shelley)
In addition, Ann and Esther pointed out that less-skilled writers sometimes used
too many paragraphs or none at all:
What he has done, um, totally over-paragraphed, he has no idea of how to group
like ideas. (Ann)
So on this one the first thing I noticed was that there‟s no paragraphing. That‟s
just kind of a global thing as I glanced down, oh okay, this looks like a lot within
the paragraph. The first thing I‟ll probably do is start to read through the whole
thing so I get a global sense of it. Again no paragraphing. (Esther)
Descriptor 10: Each paragraph is complete, with a clear topic sentence tied to
its supporting sentences.
According to Scardamalia and Bereiter (1987), a proficient writer makes good
use of main ideas to guide and structure their writing processes. An advanced writer
![Page 114: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/114.jpg)
102
employs a topic sentence skilfully to present main ideas or claims, previewing what the
supporting sentences will be like (Fournier, 2003). Seven teachers in this study provided
34 comments on this topic. Shelley and Sarah positively evaluated essays that included
topic sentences along with their supporting ideas:
Within each paragraph, it has an excellent topic sentence. This information
within each paragraph relates well to the topic sentence of the paragraph.
Everything in capitalism is about capitalism, everything in inherent human
nature is about inherent human nature. As a writer it flows fairly well, gives an
argument, an example in paragraph two, so capitalism is this, capitalist society
this, the word individual means this, then does the example. (Shelley)
And each body sentence has a topic sentence, um, that‟s supported with
supporting ideas, so organization is good. (Sarah)
Teachers also felt that reading comprehension was more difficult when essays
lacked topic sentences. Beth‟s comment exemplifies this well:
Okay, so again, hard to follow, lack of topic sentences. Very… weak
introduction, um, I think my greatest problem with this one is lack of
organization, I don‟t have topic sentence. I can‟t determine the supporting
sentences. (Beth)
Descriptor 11: Each paragraph presents one distinct and unified idea in a
coherent way.
In Jacobs et al.‟s (1981) rating scale, a well-formulated paragraph contains a
single main idea presented in a coherent way, with each paragraph distinguished
conceptually from the others. If a paragraph contains more than one idea or the idea is
not distinct from those in other paragraphs, it will ruin coherence at the paragraph level.
Formulating cohesive paragraphs was also a concern in this study, with seven teachers
providing 28 comments across all three essay sets. Ann noted a case in which multiple
ideas were presented without enough links in a single paragraph:
Again he‟s…, there‟s no cohesion. Within this paragraph, the sentences started
off with the sister, not getting a job, now he‟s into being useful to your country,
and pay your life…, referring back to killing in the first paragraph. (Ann)
A single idea stretched across two paragraphs was considered confusing. Both
Sarah and James pointed out this problem:
Um, I would say, nowadays, love is different, having a wife and how you get her.
Um, with the technology you can go on websites to meet people. Maybe this
![Page 115: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/115.jpg)
103
should be part of the first paragraph, if it wasn‟t set off on its own then it would
seem cohesive because the previous sentence, when he finds her he would need
to talk to the parents and the parents would have to see if he is convenient for the
girl. It‟s continuing along the same subject, so I suppose it is cohesive if it was
together but it‟s confusing because it‟s a new paragraph and I expect to see a
topic sentence and that‟s not one. (Sarah)
So each of the two body paragraphs doesn‟t really have a main idea to it either
because the second one the person will start talking again about being a soccer
player, so in the first paragraph they start it by talking about their dream to be a
soccer player. The second paragraph they continue talking with themselves as an
example being a soccer player. If I spend a lot of time looking at this, I can
probably figure out the difference between the two body paragraphs was, but it
doesn‟t immediately stand out. So, that‟s part of my impression that it‟s not
organized or not coherent, or based on a single main idea. (James)
Descriptor 12: Each paragraph links well to the rest of the essay.
In order to achieve coherence, each paragraph must relate logically to preceding
and successive paragraphs (Jacobs et al., 1981). Although not many teachers in this study
noted the issue of coherence between paragraphs, it did draw some attention: five
teachers commented across two essay sets, for a total frequency of 18. Ann pointed out
one case in which paragraphs were discrete, with no links between them:
His…there‟s no links between paragraphs, they‟re discrete, first second third…
(Ann)
Teachers commented positively when paragraphs were connected well. Sarah‟s
report illustrates this point well:
Okay, he‟s alluding back to the points he‟s made in the preceding paragraphs,
which is good, and simplifying them. Rephrasing, allows individuals to get on
with their lives. Good links. (Sarah)
Descriptor 13: Ideas are developed or expanded throughout each paragraph.
Idea or thesis development has long been regarded as an important criterion by
which to assess ESL writing. Brown and Bailey (1984) referred to it as logical
development of ideas, and considered it a subscale of their ESL academic writing scale.
Similarly, in their rating scale, Jacobs et al. (1981) noted that a well-written essay
develops and expands a thesis or main ideas into a paragraph unit to convey a sense of
completeness. In this study, seven teachers provided 58 comments on idea development
across all three essay sets. George commented that idea development was likely to be
![Page 116: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/116.jpg)
104
associated with the length of writing; he specifically indicated “little writing” and
“volume of writing” in relation to paragraph development:
Okay, what I can see here is there‟s little writing, so because I do look at the
volume of writing that one can do in 30 minutes, seeing it‟s a few sentences
broken in very small paragraphs suggests to me there‟s some issue with
proficiency level in the way it‟s written, because paragraphs don‟t usually just
have two sentences, but they need more to expand the writing or expand the
paragraph development because the paragraph development is really weak.
(George)
He needs to go. I can see the gems of it, people do not need to talk to each other,
you can see where he‟s going, but he just hasn‟t expanded it enough. (Ann)
Esther commented that linguistic resources are an essential tool to expand a
writer‟s argument:
They don‟t know how to develop their argument at all, they‟re repeating the
same sentence over and over again, there‟s a hint to me that they want to talk
about business and they want to compare it to the past… the way we can
cooperate in the past and future… but they don‟t have the language to do it and
they don‟t know how to develop it. (Esther)
Descriptor 14: Ideas reflect the central focus of the essay, without digressing.
Sperber and Wilson (1986) suggested that coherence is affected by the extent to
which relevant information is given to a particular context. Similarly, Fischer (1984)
argued that pertinence is an index that determines the overall impression of an essay in
foreign language writing. The importance of keeping an essay focused was also
identified in this study. Although only three teachers provided a total of 15 comments on
the issue of digression across two essay sets, loss of focus in essays was considered
problematic.23
The excerpts below illustrate the teachers‟ thought processes on
digression:
All the sudden he‟s talking about insuring successful business, so we‟re kind of
losing focus. (Beth)
Sometimes there‟s digression, whereas in English we like the writing to be more
concise, this, I see, is a little too…it digresses into his personal experience which
is interesting but it doesn‟t sort of, keep the focus. It loses the focus a little bit.
(George)
23
The low frequency in this category could be attributed to the fact that the concept of digression overlaps
with other coherence features. This issue is revisited when ESL academic writing experts‟ reviews are
discussed in the next section.
![Page 117: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/117.jpg)
105
That is why the FBI and CIA…there‟s a bit of digression in this thinking. A lot of
focus on the Second World War… yeah, and then really digressing completely
from the topic… (George)
Descriptor 15: Transition devices are used effectively.
Despite heated debate on the relationship between cohesive devices and overall
writing proficiency (e.g., Evola, Mamer, & Lentz, 1980; Grabe & Kaplan, 1996), the use
of transition devices is considered critical in written discourse. Transition devices are
words or phrases that bridge a thought from one sentence to another or from one
paragraph to another. Good transitions connect ideas smoothly, with no abrupt jumps or
breaks, helping to create a unified piece of writing. Several types of transition devices are
used to move readers in the writer‟s intended direction, including (a) example, (b)
addition, (c) emphasis, (d) comparison, (e) contrast, and (f) cause and effect. Eight
teachers in this study noted the use of transition devices, providing 56 comments across
all three essay sets. Tim, Ann, and George commented that the correct use of transition
devices signalled subsequent ideas effectively:
But at least they use „thus,‟ so they‟re letting us, telegraphing to us that they‟re
now giving us the thesis. This is what they agree with it and what they‟re going
to prove in this essay is that is true, because they agree with that. (Tim)
As I said, he starts the last finally, where he‟s going to summarize and he‟s tried
to link it back to the points he‟s made. (Ann)
Then the framing of the second paragraph to begin with is great in the second
place, to sum up, so that shows the reader the different steps in the argument,
first, second, and the conclusion. (George)
Beth noted that when appropriate transition devices were not used, it was
difficult for readers to follow the text:
It‟s interesting because… first paragraph, personal reference, and then reference
to others, so that‟s okay… second paragraph is just personal reference… my
feeling is the first and second body paragraphs are almost contrasts but not able
to use that contrast, here I am, maybe that‟s why I‟m saying it‟s too simple to
begin with, although I gave up my dream to be a soccer player, I still believe that
the answer should be in the first place, people should also consider about their
future job. I guess what I‟d try to look for is on the other hand because they‟re
discussing two choices, making the comparison between the two potential
choices but comparison isn‟t coming out, it‟s a lack of grammar, that might just
be opinion based but it helps cohesion to see someone is actually making points
about the first choice, and making points about the second choice and then
![Page 118: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/118.jpg)
106
maybe making their own conclusion, but that transition is inevident, so it‟s
harder to read. (Beth)
Descriptor 16: Syntactic variety is demonstrated in this essay.
Few existing rating scales exclusively measure syntactic variety in ESL writing.
Although the ETS (2007) writing rubrics and Jacobs et al.‟s (1981) analytic scale
describe syntactic variety and varied sentence types as the characteristics of good writing,
they are not assessed as an independent textual quality. This might be because those
qualities are interrelated with other syntactic features and automatically satisfied if a
writer achieves other syntactic effectiveness in written discourse. Syntactic variety might
therefore be co-constructed by the complementation of other essay characteristics. In this
study, only four teachers provided six comments on the extent to which a writer
demonstrated structural flexibility in the composition. The excerpts below present the
teachers‟ perceptions of syntactic variation:
I‟m not seeing a lot of demonstration of a variety of syntax and grammar. (Beth)
So yeah there‟s some… there‟s a pretty good variety, um, there‟s no passive
structures, but they don‟t need to be necessarily. It‟s not a problem that I would
notice right away. (James)
Descriptor 17: Complex sentences are used effectively.
Although the validity of complexity measures has often been questioned (Polio,
2001)24
, grammatical complexity is regarded as a critical measure in SLA studies that
determine the quality of ESL writing. It generally encompasses multiple dimensions of
variation, density, and sophistication, judging the presence of specific grammatical
features such as coordinate clauses, independent clauses, and dependent clauses. Fischer
(1984) referred to grammatical complexity as syntactic complexity in his written
communication rating subscale. The IELTS writing rubrics also assessed complexity in
its grammatical range and accuracy subscale.
The teachers in this study identified the need to assess a writer‟s ability to use
complex sentence structures effectively, with six teachers providing 53 comments across
24
As Polio correctly pointed out, the ability to produce complex sentences does not necessarily imply
high proficiency in writing, since essays composed of too complex sentences are not always good essays.
By the same token, more proficient learners who experiment with newly-acquired, complex linguistic
features can make more errors than less proficient learners (Fulcher, 1996b).
![Page 119: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/119.jpg)
107
all three essay sets. Ann and George focused on sentence structure sophistication; George
in particular assumed that students‟ appropriate use of conjunctions and connecting
phrases was a hallmark of advanced writing skills:
The first reading I thought, my goodness, but then I looked at it, it had a lot of
complicated structures. They don‟t always come off but certainly they‟re used
appropriately, and they‟re very complicated. He‟s really trying for some
sophisticated language. (Ann)
Yeah, oh yeah, the writing is quite advanced because they have the conjunctions,
the connecting phrases that make the writing flow nicely, and sophisticated
sentence structure, you know, compound sentences. (George)
Tim noted a case in which the writer did not know how to combine ideas spread
across several simple sentences into one single complicated sentence. He recommended
that the write uses a relative clause:
I have a sister. She older than me. She finish her school now. Should make that a
relative clause, and could‟ve made one sentence out of those first three. (Tim)
Beth described a case in which the writer failed to create a complex sentence due
to inability to distinguish a dependent clause from an independent clause:
When you study subjects interesting to you, a dependent clause, inability to
distinguish between dependent and independent clauses, that‟s what I refer to
them as complex sentences. Inability to properly form complex sentences. (Beth)
Descriptor 18: Normal word order is followed except in cases of special
emphasis.
Different languages have different word order systems. For example, the Korean
and Japanese languages follow a SOV (subject-object-verb) system, whereas Arabic and
Hebrew follow a VSO (verb-subject-object) system. Word order in English follows the
SVO (subject-verb-object) rule, which is fixed at the sentence level (Celce-Murcia &
Larsen-Freeman, 1999). English word order is considered relatively easy to teach
compared to other English grammar rules. In this study, six teachers provided 19
comments on the ways in which writers ordered their words. Specifically, Ann and
Esther discussed the basic word order in English:
Word order, it‟s not a serious problem but there‟s several instances. (Ann)
Some of the word order is okay, like the subject verb is in order. There‟s a
subject and a predicate, but not always. Not always. (Esther)
![Page 120: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/120.jpg)
108
George focused specifically on the writer‟s word order within a phrase:
A little word order here with adjective placement, so one of Mexico‟s top
colleges, sort of focusing on adjective order is important. (George)
Esther commented on the influence of first language on word order, noting that
one writer was definitely translating L1 text to English with no awareness of the English
word order system:
Hah…, I would explore with the student, word order. I think such thing is now
both, both now and past, exist. Yeah. I wonder… I wonder if…, again I would
know more when I knew the student but I wonder if the word order… in their
mother tongue, I‟d want to explore that with them. They‟re translating. I‟d go
back but I‟m wondering…, they‟re certainly translating. (Esther)
Descriptor 19: Sentences are well-formed and complete, and are not missing
necessary components.
When sentences do not contain all of their necessary constituents, they are called
sentence fragments. Although sentence fragments may be punctuated or capitalized like a
sentence, they are technically phrases or clauses. According to Hinkel (2004), separated
adverb clauses or prepositional phrases are the most common types of sentence
fragments found in student writing. All of the teachers in this study noted sentence
fragments, providing 62 comments across all three essay sets. Shelley reported a case in
which a sentence began with a conjunction:
Started with a conjunction so it made an incomplete sentence, needs to change it
to however, again another conjunction, so it‟s technically a sentence fragment,
but the fragment is created from starting a sentence with a conjunction. So I
would explain it that way, tell them that they could use however, and that would
make the sentence accurate, or make this clause part of the previous sentence.
(Shelley)
Another type of sentence fragment that occurred frequently in the essays was a
missing subject or verb. Ann and Sarah pointed out this problem:
One recurrent error seems to be the lack of a subject in phrases. „It‟ is important,
um, my country, „we‟ do not have courses. (Ann)
Well here the student forgot a verb, one reason why I think the ability to
cooperate well with others „is‟ more important today, forgot the word „is‟. Just
wrote others more important today, so that‟s an issue. (Sarah)
![Page 121: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/121.jpg)
109
Descriptor 20: Independent clauses are joined properly, using a conjunction and
punctuation, with no run-on sentences or comma splices.
At the sentence level, the two most common grammatical errors are run-on
sentences and comma splices. Run-on sentences occur when two or more independent
clauses are joined with no punctuation or conjunction, whereas comma splices occur
when two independent clauses are joined with a comma but lack a coordinating
conjunction. All of the teachers in this study noted run-on sentences and comma splices,
providing 41 comments across all three essay sets. George felt that readers could become
lost in run-on sentences:
What I probably suggest is that there was that bit of a run-on sentence in that
second paragraph, and this is where editing comes in to keep the focus clear,
once the reader starts getting lost in the writing in the form, then the message
disappears. So what I‟d recommend, the writer put a colon here, anywhere, with
semi-colons and colons. I‟ll often see if there‟s a way to make it into two
sentences, make it more concise and efficient, and clear. (George)
Tim commented that too many thoughts in one sentence without appropriate
connectors interfered with reading comprehension:
Run-on sentences. Um, comma splices… they lack the ability to cut things short,
to get to the point. They‟re stringing too many thoughts together so it‟s really
hard to figure out what they‟re saying. (Tim)
That sentence also is a comma splice, trying to join two thoughts together with a
comma, which is not possible. They either need to create a new sentence, two
sentences or join it with a coordinator. (Tim)
Descriptor 21: Major grammatical or linguistic errors impede comprehension.
Hendrickson (1980) defined a global error as “a communicative error that causes
a proficient speaker of a foreign language either to misinterpret an oral or written
message or to consider the message incomprehensible with the textual content of the
error”, and a local error as “a linguistic error that makes a form or structure in a sentence
appear awkward, but, nevertheless, causes a proficient speaker of a foreign language
little or no difficulty in understanding the intended meaning of a sentence, given its
contextual framework” (p. 159). The teachers in this study commented that global errors
tended to obscure meaning, whereas local errors did not interfere with their
comprehension. Eight teachers provided 42 comments on this grammatical feature across
![Page 122: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/122.jpg)
110
all three essay sets. The excerpts below present the teachers‟ perceptions of global and
local errors:
Okay. Yeah, very weak grammatically in that it interferes with my
comprehension, the country that elects this kind of cars. (Ann)
Um, okay, so again a few grammatical errors do not infringe on meaning but
causes the reader to have to pause and read again, for example the first sentence
of the fifth paragraph, a good example would be the current Iraqi situation of
political chaos where the presented Iraqi government council is supposed to be
representing, minor but… (Beth)
Unfortunately, the grammar errors are obscuring the comprehension of that.
(Tim)
Descriptor 22: Verb tenses are used appropriately.
The English verb tense system seems insurmountably complex from a cross-
linguistic perspective, and it requires considerable effort for ESL learners to master the
12 tense-aspect combinations. Celce-Murcia and Larsen-Freeman (1999) noted that the
use of tense-aspect-modality (TAM) can be fully grasped “only when we consider their
discourse-pragmatic and interactional features as well as their formal and semantic
features. The challenge of the English TAM system...is on use” (p. 174). The complex
nature of the English verb tense system is also found in Vaughan (1991). She noted that
raters consistently focused on problems with verb tense in ESL student writing, making it
the third most-frequently mentioned evaluation criterion. The teachers in this study were
also concerned about verb tense issues. Nine teachers provided 84 comments across all
three essay sets, making verb tense the third most-frequently mentioned evaluation
criterion among the 39 descriptors. As the teachers‟ reports demonstrate, some writers
had difficulty using a consistent verb tense to indicate one time frame. James and Judy
described this point well:
Um, then there‟s some grammar things, sometimes a present, past tense, when I
was a child, my dream „is‟ to be a soccer player. (James)
Last paragraph, for example, I was interested in chemistry university, and I finish
the bachelor with very good grades…, finish„ed‟ because it happened in the past,
university took place before. (Judy)
![Page 123: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/123.jpg)
111
When a writer inappropriately expressed time references in describing
incidences happened in different time frames, it seriously impeded the readers‟
comprehension. Beth and Tim made this point well:
Um, attempt to use time phrases and cause and effect clause but not able to use
them to translate meaning, um, the war was still happen in the USA. So again
time references aren‟t correct and therefore the use of verb tenses are not correct,
and that example is, that means in the past two weeks, the war was still happen,
so in the past two weeks. This isn‟t about the past two weeks, and the war still
happen, again, incorrect use of grammar tense, jumping from grammar tenses,
jumping back and forth, inability to follow storylines, which is what they seem
to be trying to do here. That‟s, Thousand people might be unluckily. That sounds
horrendous. But it is true. And this, I‟m assuming, is a reference to the British
war, so I‟m assuming it‟s a reference to the past but an inability to express the
past, so inability to express ideas in the proper time is a great weakness. (Beth)
Sure, the first sentence, entirely from this thing, I think both the present and past.
We‟re talking about two time frames here. We need a verb that‟s going to agree
with both of them. We need a present and a past tense. (Tim)
Descriptor 23: There is agreement between subject and verb.
Subject-verb number agreement is a relatively easy concept for English learners
to master, although there are some exceptions to the rules. The straightforwardness of the
agreement rule thus barely impeded reading comprehension, even when that rule was
broken. Seven teachers provided 64 comments across all three essay sets. As the excerpts
below show, most errors occurred when writers attempted to create third-person singular
present verbs:
That is why technology exist„s‟, subject verb agreement, but from a content
perspective, I know what his opinion is and I‟m assuming what‟s going to follow.
(Ann)
So that is a good sense of the importance of organization but there are some
grammatical issues, subject verb agreement, someone totally do„es‟ not pay
attention… If someone „is‟ very interested, so again someone being singular,
seems a bit of a problem, the student has to know that someone is singular and
she/he decide„s‟ so again subject verb agreement, so we have grammatical
accuracy issues, subject verb agreements. (George)
Again in the first paragraph, the interaction between human beings have
increased should be has increased, person hasn‟t recognized…, subject verb
agreement is the word interaction, not the word means, a common error, a noun
followed by a prepositional phrase, which ends in a plural, people sometimes
assume that the word closest to the verb is going to be the one that makes the
![Page 124: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/124.jpg)
112
agreement but in this case it‟s not because the interaction makes the noun there.
(Tim)
Descriptor 24: Singular and plural nouns are used appropriately.
English nouns have two different forms: singular and plural. Uncountable nouns
take a singular form, whereas countable nouns take both singular and plural forms. As
with other local grammar errors, misuse of singular and plural noun forms did not cause
serious misinterpretations of text. While it could be a recurrent problem in student
writing, teachers were usually able to guess the writer‟s intended meaning. In this study,
seven teachers provided 40 comments on this topic across all three essay sets. The
teachers‟ thoughts on this local grammar feature are presented below:
… due to low transportation, American had transportation, Americans, plurals.
(George)
… finished the bachelor degree with very good grade„s‟, plural for grades (Judy)
So the ability to cooperate well with others is generally necessary in everyday
lives, there‟s no need for a plural there. If you‟re going to say that in peoples‟
everyday life, if they have put that as the possessive adjective, then they could‟ve
used lives, but because of the way they‟ve written, it has to be a singular. (Tim)
Started out really well, but by saying I consider in today‟s world the ability to
cooperate well with other… should‟ve been others. (Shelley)
Descriptor 25: Prepositions are used appropriately.
Prepositions can have multiple meanings in different instances, and those
meanings are constructed in different ways (Celce-Murcia & Larsen-Freeman, 1999).
Taylor (1993) referred to this aspect of prepositions as “polysemous”. Their polysemous
quality renders prepositions difficult for even advanced ESL learners to master, although
their function is sometimes quite basic. Prepositions can also be combined with other
lexical units such as nouns, adjectives, and verbs to form a particular meaning. In this
study, nine teachers provided 44 comments across all three essay sets. As Tim noted,
students sometimes did not know the accurate meanings of “in” and “on”:
Problem with preposition. It‟s „in‟ the labor market, not „on‟ the labor market.
Prepositions are often difficult. It makes a huge difference because as soon as
you use the wrong one, you know something is quite off there. (Tim)
![Page 125: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/125.jpg)
113
In Shelley‟s report, one writer combined two prepositions inappropriately, so
that neither functioned as intended:
In behind, you‟ve got two prepositions here, and probably neither of them works.
(Shelley)
In some cases, the writer did not know which prepositions should be paired with
particular nouns, adjectives, and verbs. Ann‟s comments illustrate this point well:
When we graduated high school, any interests „on,‟ it‟s more like, um, but
interest „in,‟ the combinations rather than just prepositions in general, graduated
„from‟ high school. (Ann)
Descriptor 26: Articles are used appropriately.
The definite article “the” and indefinite articles “a” and “an” are part of an
English reference and determination system that is notoriously difficult for students with
non-English backgrounds to master (Celce-Murcia & Larsen-Freeman, 1999). For this
reason, some researchers have argued that English articles are unteachable (Dulay, Burt,
& Krashen, 1982). Seven teachers in this study provided 52 comments on article use
across all three essay sets. Ann and Judy discussed it as follows:
He stated his argument again, problem with articles, both the definite and
indefinite, but he stayed in his premise, he said. Then he goes on to say I will
extend my article in three points. Again, articles. (Ann)
But if I choose a subject related to job or career, I‟ll not interest in job at the
time, of course I‟ll be happy to get the job directly, but in future I‟ll try to change
the area of work or I can develop the area of work or because not interest… um,
there‟s a missing article, if I choose the subject related to „a‟ job or „a‟ career.
(Judy)
Sarah commented that a misused article did not cause a major comprehension
problem:
There are only a few issues with um, articles using „the‟ when it‟s not necessary
but even that mistake is not huge because it doesn‟t impede communication and I
still clearly understand what the writer means. (Sarah)
Descriptor 27: Anaphora (i.e., pronouns) reflects appropriate referents.
Anaphora is a linguistic expression that refers to an antecedent, and an
antecedent is an object to which an anaphora refers in discourse. In English, an anaphora
typically takes pronoun form. Although the reference and pronoun system in English is
quite straightforward, misuse of an anaphora or omission of antecedents can make it
![Page 126: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/126.jpg)
114
difficult for readers to understand a writer‟s intent. In this study, all of the teachers noted
the appropriate use of referencing, providing 51 comments across all three essay sets.
Esther pointed out that it was not clear what the pronoun “it” referred to:
Because I regret it, we don‟t know what it is and we don‟t know of course if they
regret answering the question or whether they regret that they didn‟t study
subjects they were interested in, so there‟s no reference. (Esther)
There was an occasional lack of consistency between pronouns. For example,
Ann and Tim commented that it was confusing when writers used multiple different
pronouns to refer to the same referent:
It‟s, um, consistency, like he‟ll talk about the people and then say he, or um, I
can‟t find the other one. There were a couple of other examples… I can‟t find it.
I‟ll just study economics even though he didn‟t have any interest on them. I mean,
that‟s hard to tell. (Ann)
Here we have a problem with this pronoun agreement, because he‟s talking
about studying what we‟re interested in. That‟s a singular, then when he goes
into the next one, he certainly brings the pronoun them, which what does it refer
back to? He doesn‟t give us anything. He is agreeing with that, then he goes on
to finish it, with an it. So we have started with a singular subject. We‟ve
switched to a plural pronoun and then to a singular pronoun which totally
confuses us, because we don‟t know what the them refers to. The it for the study
perhaps but the them throws it off. (Tim)
Descriptor 28: Conditional verb forms are used appropriately.
As with the tense-aspect system, it is difficult for English learners to have a full
syntactic and semantic grasp of conditional sentences. This could be because conditional
sentences consist of two clauses and imply three different kinds of semantic relationships:
(a) factual conditional relationships, (b) future conditional relationships, and (c)
imaginative conditional relationships (Celce-Murcia & Larsen-Freeman, 1999). Although
frequency in this category was quite low (five teachers provided nine comments),25
teachers in this study pointed out frequent conditional verb errors that writers committed.
The teachers‟ perceptions of conditional verbs are as follows:
And again his verb sequencing there, he needs some sort of conditional, they
would, they could, they should. (Ann)
25
The low frequency might be because the knowledge of tense-aspect subsumes that of conditional verb.
This issue is revisited when ESL academic writing experts‟ reviews are discussed in the next section.
![Page 127: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/127.jpg)
115
A lot of modals are being used, um, and you know the conditional is being used.
(George)
Instead of I will not interest, I would… conditionals should be in there, I would
not be interested in that job. (Judy)
They could give up the job, a lot of this could thing, the use of could, um, that
conditional comes out in a lot of… especially in Korean writing and getting to
understand why you would not use could. (Shelley)
Descriptor 29: Sophisticated or advanced vocabulary is used.
Read (2000) defined word sophistication as “a selection of low-frequency words
that are appropriate to the topic and style of the writing, rather than just general,
everyday vocabulary,” which included “technical terms and jargon as well as the kind of
uncommon words that allow writers to express their meanings in a precise and
sophisticated manner” (p. 200). In general, word sophistication is measured as the
proportion of advanced words appearing in the text; however, there is a problem of
subjectivity with regard to what is actually considered advanced (Laufer & Nation, 1995).
The teachers in this study noted word sophistication in student writing, providing 48
comments across all three essay sets. As Beth‟s verbal report shows, teachers gave
positive evaluations to essays written using sophisticated words:
I like the attempt at using more sophisticated vocabulary, so, utterly, delved,
dazzling, um, in this regard, um, transparency. (Beth)
When writers used unsophisticated words, the teachers pointed them out
immediately: George and Shelley mentioned that general words such as “things” “bad”,
and “good” were too vague to effectively convey the intended meaning:
Instead of two things because things is so general, state specifically what you‟re
talking of and use phrases from the prompt. Because it just makes it more clear,
using a word like thing is so vague. (George)
Not sophisticated, probably you will be a bad professional, you will be a good
professional if you… we don‟t know what they mean by that, it‟s too general,
that to me is middle school, what would a bad professional be, what do you
mean by that. Not sophisticated. Needs a lot more depth but I think there‟s an
effort to go above the high school level. (Shelley)
I can hear like an 18 year old saying this stuff. It‟s not extremely sophisticated or
academic. (Judy)
![Page 128: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/128.jpg)
116
Descriptor 30: A wide range of vocabulary is used, with minimal repetition.
Word variety refers to the degree to which a series of diverse words are
presented with the skillful use of synonyms, superordinates, and other related expressions,
without repeating the same words in a limited range (Read, 2000). According to
Linnarud (1986) and McClure (1991), the measure of word variety is calculated as the
proportion of different words to the total number of words in the composition. Compared
with word sophistication, fewer teachers in this study focused on word variety: six
teachers provided 16 comments across all three essay sets. As their verbal reports
illustrate, most teachers noticed when writers repeated words:
He‟s not using any substitutions, he‟s not saying „work together‟ or „participate,‟
or anything to replace „cooperate.‟ (Ann)
A lot of repetition, coordination, coordination, coordination, and so often that
bounces off the page quickly. (Beth)
What it looks like to me is, you‟ve got interest, interested, and interesting. What
this person is trying to do, is to use all those words to make his essay interesting,
but has failed because it just goes on and on. Three words the same in one
sentence, what is the point? (Tim)
I think the choice of the word drive is interesting. Talk about being driven…, it‟s
not something most students are being aware of, being able to use that way. I
think this person understands what they‟re doing. But I wish they had used
another word or explained what they meant by that, just because they used the
word drive over and over. It‟s not like okay, because driven is very strong so I‟m
not convinced that they know what they meant, they needed to put some variety
in there to make the idea more clear. (Shelley)
Descriptor 31: The meaning of vocabulary is understood correctly and used in
the appropriate context.
Even when writers use a wide range of sophisticated words in their writing, word
choice that imposes an incorrect semantic meaning can be problematic. As Laufer and
Nation (1995) rightly argued, the issue is not a rich vocabulary but a well-used rich
vocabulary that has a positive impact on the written text. In this study, eight teachers
provided 53 comments on word choice across the three essay sets. In one case, a writer
used a word that was semantically unrelated to the intended meaning:
![Page 129: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/129.jpg)
117
Hence it makes little sense to study extravagant subjects, I mean it starts off
good but extravagant, it is an inappropriate adjective. Doesn‟t tell me anything?
It‟s totally out of place. Extravagant, we talk about material things. (Tim)
There were also cases in which a writer used a word that was semantically close
to the intended meaning, but did not feature in accurate meaning:
The vocabulary is very good, it‟s got a lot of good phraseology, um, good choice
of verbs… cut and dried question, ability to inspire, ignite parents‟ debate. Um,
my problem is, he seems to be throwing them in without quite knowing what
they mean. So it looks very impressive at first, then you think, „well…, wait a
minute, what does that actually mean?‟ So, I think the vocabulary, um, the good
phrases there, covers up problems with accuracy. (Ann)
This to me sounds like a student who‟s learned lists, okay, one there, one there,
without knowing the exact meaning of the vocabulary. Inspire students is good,
but inspire students‟ concern does not follow. A cut and dried question. No. No
matter knowledge or skills, there are obstacles awaiting us. As I said I think
reading, giving him substitution exercises where he has to put them in would
help. (Beth)
So I will bolster doesn‟t really work in this context, and it made it hard to figure
out what the opinion was. (Shelley)
Descriptor 32: The essay demonstrates facility with collocations, and does not
contain unnatural word-by-word translations.
Although it is notoriously difficult to define collocations and different definitions
abound in the literature (Leśniewska, 2006), they are generally understood as word
combinations that occur more than often that would be expected by chance. The
restrictiveness of collocation rules makes it challenging for ESL writers to use them
appropriately in written discourse. Indeed, Waller (1993, as cited in Leśniewska, 2006)
considered collocations the language feature that stigmatized “a foreign accent in writing.”
In this study, seven teachers provided 25 comments on the use of collocations across all
three essay sets. Ann and Shelley were particularly sensitive to semantically general
“high-utility” verbs (Leśniewska, 2006) such as “make” and “do” and their
accompanying nouns:
Do the opposite, make a career, those are the sort of verb-and-noun collocations.
That are often, you have to know them, it‟s very difficult to apply rule and work
it out, but the language sounds natural as well. (Ann)
![Page 130: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/130.jpg)
118
I always point them out but I don‟t um… word choice errors, I did the best
choice rather than made the best choice, they need to understand it‟s like a
collocation, you make a choice not do a choice, or verb phrases, I often put them
as verb phrases. (Shelley)
Descriptor 33: Words change their forms where necessary and appropriate.
The correct use of word form was another criterion that determined overall
writing quality. Engber (1995) identified word errors focusing on derivations, verb forms,
phonetic and semantic associations, and spelling. Derivational errors occur when a writer
is not able to discern different word groups such as nouns and verbs, and errors in
phonetic and semantic association occur when a semantically unrelated word is
improperly used by analogy with phonetic similarity (Engber, 1995). Consistent with
Vaughan‟s (1991) findings that morphology errors are among the most frequent, the
teachers in this study all showed great concern with word forms, providing 59 comments
across all three essay sets. In particular, the teachers were most attentive to a writer‟s
knowledge of word groups:
Capitalism, capitalist, he knows his word groups. (Ann)
And make their own choose, here we have sort of a word form issue. (George)
… for instance I choiced my study, when your parents choiced your career,
that‟s actually the wrong form of the word. (Shelley)
Word form, for example, when any company needs to open any manufacture, the
word manufacture, maybe the students mean manufacturing company, some
other form but not that. Seems to me he‟s done that kind of mistake before, or
error, I don‟t know. Finally we can see safety, healthy, world, and happy life…
I‟m not sure if he meant, we can see a safe and healthy world, but I‟m not sure.
(Sarah)
Descriptor 34: Words are spelled correctly.
Despite its discrete and superficial nature, spelling was the most frequently-
mentioned evaluation criterion among the 39 descriptors. The teachers consistently
pointed out spelling problems, providing 104 comments across all three essay sets. As
the teachers‟ verbal reports illustrate, serious spelling errors often obscured the writer‟s
intended meaning to a great extent:
Again, the spelling totally threw me off there. (Ann)
![Page 131: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/131.jpg)
119
Um, the one word I don‟t understand is dein or dyin themself. I think they mean
deny…, that is simply because in modern society, people deny themself…, oh,
sorry, it‟s define. The „f‟ is missing. Then „self‟ should be „selves,‟ to a great
extend is wrong, it should be „extent.‟ (Judy)
This is almost impossible to read because there‟s numerous spelling mistakes.
(Tim)
There‟re some spelling problems like the word example is spelled wrong. (Sarah)
Descriptor 35: Punctuation marks are used correctly.
Punctuation marks such as commas, full stops (periods), apostrophes, and colons
are the linguistic symbols that separate words into phrases, clauses, and sentences in
order to clarify meaning. Contrary to Milanovic et al.‟s (1996) findings that punctuation
was of little interest to the raters on the First Certificate in English (FCE) examination,
all of the teachers in this study paid a great deal of attention to punctuation in written text,
for a total of 66 comments across all three essay sets:
Punctuation is a bit iffy as well. You don‟t need commas in places where he‟s
putting them. (Ann)
Yeah, with the help of the internet, period, we can communicate with each other
easily, period, but before the internet, period, people contact with others only by
letters. So, inability to use punctuation properly which then does impact meaning.
In my opinion, quite basic and leads to misunderstanding, improper use of
punctuation… (Beth)
One thing that puts me off right away is that this person has an annoying habit of
putting a comma. It doesn‟t really…, and you know they don‟t put a space
between the end of the sentence and the beginning of the next sentence. (James)
Punctuation is all over the place, either they‟re not sure they must use a comma
or suppose to use a period, or they think it‟s the same thing. That‟s really critical.
Punctuation is critical. Comma is really important piece of punctuation. If you
don‟t know where to use it in place of a period, like in the second paragraph, it
looks to me like, um, after the word success there, that should be a period,
starting a new sentence. But the way it‟s written, the capital goes before, which
beginning example which would not be a new sentence. (Tim)
Unlike other mechanical problems, punctuation use can be somewhat
complicated because of its interrelationship with syntactic structure. As Shelly and Tim
pointed out, the misuse of punctuation marks can cause problems at the sentence level:
![Page 132: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/132.jpg)
120
I‟d have to go back, the punctuation is fusing sentences, dependent clauses,
independent clauses but I just want to look at a couple of those to see if it‟s just
the punctuation, or if I ignore that punctuation, is the syntax right? (Shelley)
So I would have to go back on this one and I would have to go through these
sentences again, in order to see whether in fact, it is the punctuation that is so
bad. Of course, there are no sentences here because of how it‟s punctuated, but
I‟d want to go back and see if there‟s a clear subject predicate for most of the
sentences. (Tim)
Descriptor 36: Capital letters are used appropriately.
Although it was not a major linguistic feature in the composition, capitalization
was noted by eight teachers who provided 19 comments across all three essay sets. As
James indicated, misuse of capitalization did not serve to distract from the rest of an
essay:
It‟s not much, even the second paragraph where he‟s talking about the master‟s,
it should be a capital. (Ann)
i prefer and i not being capitalized. (George)
the average american, american, capitalization, had left the of rest. (George)
So, I also notice there‟s like, you know the person doesn‟t capitalize I, so there‟s
some very basic conventions that they don‟t follow, um, so they should um, the
teacher giving feedback, because of such a relatively simple convention to
correct, I would say always capitalize your I. I make that a point with a student
because it‟s something that can be achieved rather easily. But it‟s not the most
important thing to be distracted about when you‟re marking. (James)
Descriptor 37: The essay contains appropriate indentation.
Relatively few teachers cited formatting issues, with five teachers providing 10
comments on indentation across all three essay sets. They noted that indentation would
have helped to create a better visual layout of the paragraph structure. Tim and Shelley
made this point well:
I think that‟s the first thing that just should be taught. Indenting every time you
have a thought, that‟s great but… (Tim)
I‟ll call this paragraph one, the first sentence. They haven‟t indented paragraphs
so it‟s hard to tell where paragraphs are. (Shelley)
![Page 133: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/133.jpg)
121
Descriptor 38: The essay prompt is well-paraphrased, and is not replicated
verbatim.
Three teachers provided five comments on paraphrasing across two essay sets.
As Esther and Tim noted, some writers just replicated the prompt without rephrasing it in
their own words:
They were even able to take the prompt and they didn‟t just replicate the prompt.
(Esther)
I agree with the statement that the ability to cooperate well with others is far
more essential in today's world than it was in the past, of course that‟s taken
directly from your question. (Tim)
I think also everywhere in this society. To be specific…, okay, here we go, again
they‟re repeating what‟s in the prompt. (Tim)
Descriptor 39: Appropriate tone and register are used throughout the essay.
Researchers have suggested that writing is a productive endeavour that is
socially constructed between an individual writer and a particular context or culture.
Grabe and Kaplan (1996) incorporated genre knowledge and audience considerations
into their writing skills taxonomy to reflect the significant role of sociolinguistic
knowledge in writing. Swale (1990) also recognized the importance of genre knowledge
in academic writing. One component that constitutes sociolinguistic competence is the
knowledge of register (Bachman & Palmer, 1996). According to McCarthy (1990) and
Read (2000), register governs vocabulary choice and manifests the social dimension of
vocabulary. From an assessment perspective, Jacobs et al. (1981) and Brown and Bailey
(1984) argued that vocabulary knowledge subsumes knowledge of register, measuring
whether the vocabulary is appropriate to the audience or the tone of text.
The teachers in this study showed great interest in the appropriate use of tone
and register in written text: eight teachers provided 82 comments across all three essay
sets. As Ann pointed out, the use of first person “I” is not appropriate in academic
writing:
Um… too many „I‟s, it‟s, um, every sentence is „I,‟ „I,‟ „I,‟ so again tone and
register, totally inappropriate. (Ann)
In cases in which informal vocabulary was used, teachers commented that use of
colloquialisms or casual words are not appropriate in an academic essay:
![Page 134: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/134.jpg)
122
In a nutshell, it‟s too colloquial. Instead, to summarize, in summary, to conclude,
um, he could even, you know, to put it simply, any of those, but in a nutshell it‟s
too familiar. (Ann)
She wrote a lot, but writing doesn‟t necessarily mean anything. Yeah, she wrote
a lot and um, she also seems like somebody to me whose speaking ability is
better than writing ability, or academic writing. Because this is quite informal,
using informal…, she‟s writing the way she would speak to a friend. You know,
to put it simply, we need money, that‟s pretty informal. And she starts a next
sentence, but over the years, the business began growing, which is also informal.
You know she‟s telling a little story here, a narrative which isn‟t exactly an
appropriate, um, rhetorical structure for this type of essay. (Sarah)
George focused more on the use of punctuation marks in academic writing
conventions, particularly bullets and exclamation points:
Okay, and I would tell this student, having bullet points like this isn‟t really
appropriate for this register of academic writing, you need to just embed the
points in that last sentence because first of all they‟re brief bullet points, if they
were long it would be different. They‟re very brief, it looks good and also jars
the reader a little bit because they are so brief, it‟s not expected with this type of
writing. (George)
And also informal use… these exclamations and the um, multiple questions, I
don‟t mind a question or two at the beginning to interest a reader but there are
quite a few questions in it and sort of informal use like I bet you do and again
this use of „you‟ in this academic context is a bit informal. (George)
Characteristics of EDD Descriptors
The descriptors showed that teachers paid considerable attention to the extent to
which writers satisfactorily addressed the given topic. Although content fulfillment was
not included in the research scope of traditional SLA studies examining ESL writing,
contemporary L1 and L2 writing theories do agree that it is a central component of good
writing. For example, Grabe and Kaplan (1996) considered topical knowledge or
knowledge of the world to be a parameter that determined writing performance. Similarly,
research on rater perceptions and behaviours has verified content fulfillment as an
important consideration. These empirical and theoretical accounts support the idea that
an important criterion of written text assessment is the extent to which an essay fulfills
content requirement.
![Page 135: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/135.jpg)
123
Teachers also felt that organizational effectiveness determined the quality of an
essay. This finding was reasonable because the ability to coherently organize ideas has
long had a place in writing instruction and research; Halliday and Hasan (1976)
conceptualized cohesion and coherence as ways in which textual structure is tied together
in extended discourse. Similarly, Canale (1983) and Grabe and Kaplan (1996) suggested
that unified text can be attained through cohesion in form and coherence in meaning.
Most analytic rating scales also highlight its importance in written discourse; Hamp-
Lyons and Henning (1991) included organization as an independent evaluation criterion
in their ESL writing scale, and the IELTS writing rating scale considers coherence and
cohesion an important ESL writing subskill.
Teachers were also concerned about grammatical knowledge. Grammatical
accuracy has been one of the most-researched topics in SLA studies on writing
development and a central theme in ESL writing instruction and research. Achievement
of ESL writing skills has traditionally been defined as mastery of discrete grammar
knowledge and the ability to produce linguistically accurate written text (Kepner, 1991).
The presence of grammatical errors is therefore the primary language-related factor
affecting ESL composition teachers, suggesting that they are excessively concerned with
eradicating grammatical errors in student writing. This study reinforced that recurrent
grammatical errors were teachers‟ primary concern in student writing assessments.
Specifically, teachers‟ attention was focused on more fine-grained, specific aspects of
grammatical knowledge such as verb tense, article use, and preposition use. This finding
was noteworthy because most ESL writing scales measure learners‟ grammatical
competence at a macro-level, obscuring students‟ performance on specific grammar
components.
Teachers also showed considerable interest in various aspects of students‟
vocabulary use. Their attention to the quality of written vocabulary (sophistication,
variety, choice, and collocation) echoed the idea that a good vocabulary leads to good
writing. The importance of vocabulary in written text is supported by theoretical
frameworks of L1 and L2 writing (e.g., Grabe & Kaplan, 1996) and empirical SLA
studies (e.g., Engber, 1995; Laufer, 1991; Laufer & Nation, 1995). It has also been
recognized by a variety of ESL writing scales: Brown and Bailey (1984) and Jacobs et al.
![Page 136: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/136.jpg)
124
(1981) emphasized the close association between vocabulary and writing performance in
their ESL academic writing scales, as did the IELTS rating scale, which included lexical
resource as one constituent subscale. Similarly, Mullen (1977) found that vocabulary
appropriateness accounted for 84.4% of the variance in overall writing performance.
These research findings suggest that vocabulary is indeed an indispensable factor in
determining the quality of writing.
Writing mechanics was another area that drew teachers‟ attention; however, as
Polio (2001) rightly pointed out, mechanics has not been a central concern for language
researchers. There has been very little research examining writers‟ mechanical
proficiency in relation to their writing development. Indeed, Polio speculated whether
mechanics should even be considered a part of writing construct, since the various
aspects of mechanics (such as capitalization, spelling, indentation, and punctuation) are
not conceptually related to each other, making it difficult to form a unitary construct. Still,
mechanical knowledge does play a significant role in writing processes. Knowledge of
written code is achieved through the mastery of orthography, spelling, punctuation, and
formatting conventions (Grabe & Kaplan, 1996), and a writer‟s intended meaning would
be obscured and lost without their appropriate use. The value of mechanics can also be
found in existing writing rating scales. Jacobs et al. (1981) and Brown and Bailey (1984)
considered mechanics a component of their academic writing scales.
In summary, the review of the descriptors suggested that the five writing skills
appear to encompass all aspects of the 39 descriptors: (a) content fulfillment, (b)
organizational effectiveness, (c) grammatical knowledge, (d) vocabulary use, and (e)
mechanics. This skill configuration was consistent with the theoretical discussions and
existing assessment schemes discussed in Chapter 2. The scale created by Jacobs et al.
(1981) was particularly relevant to this classification in that it described the five ESL
writing skills in a comprehensive manner based upon empirical data. The ways in which
the descriptors correspond to the five skills will be discussed in Chapter 6, which
presents the results from the ESL writing experts‟ sorting activity.
![Page 137: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/137.jpg)
125
Refinement of EDD Descriptors
The 39 descriptors elicited from the teachers‟ think-aloud verbal reports were
subjected to review and refinement by ESL academic writing experts. Each descriptor
was examined in order to evaluate whether it was clear, non-redundant, useful, and
relevant to ESL academic writing. Three descriptors were identified as problematic: D14:
Ideas reflect the central focus of the essay, without digressing; D28: Conditional verb
forms are used appropriately; and D38: The essay prompt is well-paraphrased, and is not
replicated verbatim. The experts pointed out that D14 overlapped with D11-D13, and that
D28 and D38 addressed relevant, but too-specific, aspects of ESL writing. Indeed, these
descriptors were rarely mentioned in the ESL teachers‟ think-aloud verbal reports, with
total comments accounting for less than 1% of all verbal protocols. The experts also
suggested combining two descriptors (D16: Syntactic variety is demonstrated in this
essay and D17: Complex sentences are used effectively) to form one new descriptor,
such as “This essay demonstrates syntactic variety, including simple, compound, and
complex sentence structures”. The review and refinement process resulted in the
elimination of three descriptors altogether, and the combination of two other descriptors
into one, for a final total of 35 descriptors (see Table 14).
The clarity of the descriptors was also reviewed. The experts read each
descriptor iteratively and edited it to make an easy and clear descriptor for teachers to use.
Twenty-two descriptors were edited in this manner, with most editing focused on specific
wordings to minimize ambiguity. The descriptors were then examined for distinctiveness
and comprehensiveness. Each descriptor was confirmed to be independent of the others
and comprehensive enough to cover all aspects of ESL academic writing. No new
descriptors were added to the descriptor pool.
When the experts were asked whether the descriptors were conducive to making
a binary choice (yes or no), most commented that while choices on a four-point Likert
scale (strongly agree, somewhat agree, somewhat disagree, or strongly disagree) were
preferable for descriptors that required a subjective judgment (such as D26 and D27),
they were able to use the binary choice as well if necessary. The binary choice was used
for the EDD checklist for several reasons, but primarily because it is difficult to build a
diagnostic model using polytomous data due to technical limitations. Although recent
![Page 138: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/138.jpg)
126
development in CDA has yielded a psychometric diagnostic model that can deal with
polytomous data, the model‟s stability when applied to real (not simulated) data was
unknown. It was also questionable how increasing the parameters in such a model would
affect the model convergence and parameter estimations, given the small sample size in
this study (n=480). Finally, while some descriptors (e.g., D26 and D27) asked the degree
to which a student mastered a given skill, other descriptors relied on the absolute mastery
or non-mastery of a skill. For example, D01, D09, and D34 were more likely to be
answered with a yes (mastery) or a no (non-mastery) choice instead of a Likert-scaled
choice.
![Page 139: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/139.jpg)
127
Table 14
Refined 35 EDD Descriptors
Descriptor
1. This essay answers the question.
2. This essay is written clearly enough to be read without having to guess what the writer is trying to say.
3. This essay is concisely written and contains few redundant ideas or linguistic expressions.
4. This essay contains a clear thesis statement.
5. The main arguments of this essay are strong.
6. There are enough supporting ideas and examples in this essay.
7. The supporting ideas and examples in this essay are appropriate and logical.
8. The supporting ideas and examples in this essay are specific and detailed.
9. The ideas are organized into paragraphs and include an introduction, a body, and a conclusion.
10. Each body paragraph has a clear topic sentence tied to supporting sentences.
11. Each paragraph presents one distinct and unified idea.
12. Each paragraph is connected to the rest of the essay.
13. Ideas are developed or expanded well throughout each paragraph.
14. Transition devices are used effectively.
15. This essay demonstrates syntactic variety, including simple, compound, and complex sentence structures.
16. This essay demonstrates an understanding of English word order.
17. This essay contains few sentence fragments.
18. This essay contains few run-on sentences or comma splices.
19. Grammatical or linguistic errors in this essay do not impede comprehension.
20. Verb tenses are used appropriately.
21. There is consistent subject-verb agreement.
22. Singular and plural nouns are used appropriately.
23. Prepositions are used appropriately.
24. Articles are used appropriately.
![Page 140: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/140.jpg)
128
Table 14 (Continued)
Descriptor
25. Pronouns agree with referents.
26. Sophisticated or advanced vocabulary is used.
27. A wide range of vocabulary is used.
28. Vocabulary choices are appropriate for conveying the intended meaning.
29. This essay demonstrates facility with appropriate collocations.
30. Word forms (noun, verb, adjective, adverb, etc) are used appropriately.
31. Words are spelled correctly.
32. Punctuation marks are used appropriately.
33. Capital letters are used appropriately.
34. This essay contains appropriate indentation.
35. Appropriate tone and register are used throughout the essay.
![Page 141: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/141.jpg)
129
Summary
This chapter has discussed the identification of the descriptors that make up the
EDD checklist. Think-aloud verbal protocols from nine ESL teachers were open-coded,
focusing on recurrent evaluative themes of ESL academic writing subskills and textual
features. Thirty-nine concrete, fine-grained descriptors were empirically identified and
sequentially confirmed using theoretical analysis. The descriptors addressed all aspects
of ESL writing skills, including content fulfillment, organizational effectiveness,
grammatical knowledge, vocabulary use, and mechanics. The descriptors were then
subjected to the review of four ESL academic writing experts. The review and refinement
process eliminated three descriptors and merged two descriptors into one, resulting in a
final total of 35 EDD descriptors. These 35 descriptors appeared in the EDD checklist
accompanied by a yes or a no response option. The next chapter discusses the
preliminary evaluation of the EDD checklist.
![Page 142: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/142.jpg)
130
CHAPTER 5
PRELIMINARY EVALUATION OF THE EDD CHECKLIST
Introduction
This chapter discusses the preliminary evaluation of the EDD checklist
conducted in Phase 2. Prior to proceeding to the main diagnosis modeling, the checklist‟s
basic functionality was examined from multiple perspectives. Seven ESL teachers piloted
the EDD checklist to assess 80 TOEFL iBT independent essays, determining its
effectiveness. Both quantitative and qualitative data were collected and analyzed in order
to examine the three validity assumptions:
The scores derived from the EDD checklist are generalizable across different
teachers and essay prompts (Teacher and essay prompt effects).
Performance on the EDD checklist is related to performance on other measures
of ESL academic writing (Correlation between EDD scores and TOEFL scores).
The EDD checklist helps teachers make appropriate diagnostic decisions and has
the potential to positively impact teaching and learning ESL academic writing
(Teacher perceptions and evaluations).
The results derived from these assumptions informed the checklist‟s usability and
furthered the main study. The empirical evidence needed to justify each validity
assumption is presented below.
Teacher and Essay Prompt Effects
Facet Measures
Prior to estimating facet measures, model convergence was checked using a Joint
Maximum Likelihood Estimation (JMLE) algorithm. The convergence criteria were set
at 0.1 for the maximum size of the marginal score residual, and at 0.01 for the maximum
size of the logit change. These tight criteria were chosen to produce a result with high
precision. Convergence was reached after 43 iterations, resulting in 0.0923 for the largest
marginal score residual and -0.0002 for the largest logit change. These negligibly small
values indicated that the score difference and change are insignificant.
![Page 143: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/143.jpg)
131
The extent to which the data fit the model was then examined using two
approaches.26
The first method utilized the information derived from the FACETS
summary report. FACETS provides summary statistics that evaluate whether each facet
has been successfully estimated and whether the data fit the model. When the model-data
fit is satisfied, the mean of the standardized residuals (StRes) is close to 0 and the sample
standard deviation (SD) is close to 1 (Linacre, 2009). As Table 15 shows, the two
statistics support the model-data fit: the mean of the standardized residuals is near 0 (i.e.,
-0.01) and the standard deviationis near 1 (i.e., 1.01).
Table 15
FACETS Data Summary
Category Score Expected
score Residual
Standardized residual
(StRes)
0.59 0.59 0.59 0.00 -0.01 M (Count: 7,326)
0.49 0.49 0.23 0.44 1.01 SD (Population)
0.49 0.49 0.23 0.44 1.01 SD (Sample)
The second model-data fit evaluation method was related to unexpected
responses with extreme standard residual values. According to Linacre (2007), in order
for the data to fit the model, about 5% or less of the total standard residuals should lie
outside of the range of -2 to +2, and about 1% or less should lie outside of the range of -3
to +3. Of 7,326 valid responses, 257 responses (about 3.5%) were associated with
standard residuals above +2 or below -2 and 57 responses (about 0.78%) were associated
with standard residuals above +3 or below -3. The distribution of unexpected responses
was roughly even across all teachers, ranging from 18 to 49 (see Table 16).
Table 16
Distribution of Unexpected Responses across Teachers
Teacher No. of ratings StRes < -2 or StRes > 2 StRes < -3 or StRes > 3
Angelina 1,047 35 2
Ann 1,050 34 12
Beth 1,041 18 2
26
In the Rasch model, the point of interest is whether the data fit the model, not whether the model fits the
data.
![Page 144: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/144.jpg)
132
Table 16 (Continued)
Teacher No. of ratings StRes < -2 or StRes > 2 StRes < -3 or StRes > 3
Brad 1,050 41 6
Esther 1,043 48 17
Susan 1,046 49 14
Tom 1,049 32 4
Total 7,326 257 57
Once the model-data fit was satisfied, the estimation of each facet measure was
examined. The FACETS variable map displays all facets graphically on a common logit
scale, enabling comparisons within and between facets (see Figure 2). The first column
in the map shows a logit scale applied equally across facets. A higher logit value
indicates a more able examinee, a more difficult task, or a more severe rater, whereas a
lower logit value indicates a less able examinee, a less difficult task, or a less severe rater.
The second column shows writing proficiency measures for the 80 students.
Student proficiency measures ranged from 2.53 logits (Essay 2166) to -2.15 logits (Essay
1006), with 4.68 logit spread: the student who wrote Essay 2166 was the most proficient,
whereas the student who wrote Essay 1006 was the least proficient. The third column
displays the difficulty measures of the essay prompts. As the difficulty measures had
been adjusted to be the same (see Chapter 3), they were placed on the same point of the
logit scale. The fourth column presents the seven teachers‟ severity measures (see Table
17 for a more discussion of teacher measures): Beth the most severe in assessing student
essays, while Esther was the most lenient. The logit spread was 1.08, ranging from 0.15
logits (Beth) to -0.93 logits (Esther). Interestingly, three teachers (Angelina, Brad, and
Tom) exhibited almost the same severity measures. Finally, the fifth column presents
difficulty measures for the 35 descriptors, which ranged from 1.41 logits (D26) to -1.82
logits (D35), with 3.23 logit spread. D26 (word sophistication) was the most difficult for
students to master, while D35 (tone and register) was the easiest. A close examination of
the difficulty measures revealed that descriptors related to vocabulary knowledge were
relatively more difficult than others. For example, D29 (collocation) and D27 (word
variety) were the third and eighth most difficult descriptors (see Appendix N for detailed
information about descriptor measures). On the other hand, descriptors associated with
![Page 145: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/145.jpg)
133
grammatical knowledge were relatively easier. Most grammar-related descriptors (except
for D18 and D19) exhibited difficulty measures below the mean, suggesting relative
easiness. Descriptors measuring content fulfillment also drew attention. These were
considered difficult, as is evident from their position at the top of the column (e.g., D3,
D5, D6, D7, and D8).
The overall pattern of the FACETS variable map suggests that the elements
comprising the teacher facet were least varied compared to those of other facets. The
range of teacher severity measures (1.08 logits) was the narrowest, suggesting that they
exhibited relatively homogeneous rating behaviours. On the other hand, substantial
variability was found in the descriptor difficulty. As the wide spread of the difficulty
measures (3.23 logits) indicates, the descriptors differed greatly in terms of difficulty.
This variation suggests that descriptors measure different facets of writing skills with
different difficulty measures.
![Page 146: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/146.jpg)
134
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|Measr|+ESSAY |-PROMPT GROUP |-TEACHER |-DESCRIPTOR |
|-----+------------------------------------------------------------+----------------------------+------------------------------+-------------------------|
| 3 + + + + |
| | | | | |
| | | | | |
| | | | | |
| | 2116 | | | |
| | | | | |
| | | | | |
| | | | | |
| 2 + + + + |
| | 2096 | | | |
| | | | | |
| | | | | |
| | 1070 1111 2119 | | | |
| | 1110 | | | D26 |
| | 1113 | | | |
| | 1104 2131 | | | |
| 1 + 1037 + + + D5 |
| | 1101 1112 | | | D10 D29 D6 |
| | 1134 2025 2120 2148 | | | D14 D8 |
| | 1109 2058 2107 | | | D19 D27 D3 D7 |
| | 1005 1055 1069 1088 1107 1117 2051 2076 2095 2104 | | | D31 |
| | 1114 2079 2097 2099 | | | |
| | 1146 2109 | | | D18 |
| | 1010 1056 1080 | | Beth | D1 D11 D2 D30 D34 |
* 0 * 1013 1020 1038 1053 2050 2070 * COOPERTATION SUBJECT * * D13 D32 D4 *
| | 1023 2032 2074 2077 2080 | | | D17 D24 |
| | 1002 1018 2005 2022 | | | D15 D20 |
| | 1004 1014 1074 2002 2018 2023 | | Angelina Brad Tom | D12 D23 D28 |
| | 1011 1081 2081 | | | |
| | 2067 | | Ann | D9 |
| | 1050 2001 2020 | | Susan | |
| | 1003 1008 | | Esther | |
| -1 + 2003 + + + |
| | 2004 | | | D21 D25 |
| | 1009 2029 | | | D22 D33 |
| | 1015 | | | |
| | 1007 | | | D16 |
| | 2011 | | | |
| | 2013 | | | |
| | 2006 2019 | | | D35 |
| -2 + + + + |
| | 1006 2015 | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| -3 + + + + |
|-----+------------------------------------------------------------+----------------------------+------------------------------+-------------------------|
|Measr|+ESSAY |-PROMPT GROUP |-TEACHER |-DESCRIPTOR |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
Figure 2. FACET Variable Map
![Page 147: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/147.jpg)
135
Teacher Internal Consistency
The extent to which the teachers were internally consistent was examined based
upon teacher fit statistics. Teacher fit statistics indicate the degree to which each teacher
is internally consistent in his or her ratings. Different rules of thumb are applied for
interpreting fit statistics and for setting upper and lower limits because they are more or
less context-dependent and require a targeted use of the test results (Myford & Wolfe,
2004a). When a test of interest is used to make a high-stakes decision, tight quality
control limits (such as mean squares of 0.8 to 1.2) are set; however, if the stakes are low,
looser limits are acceptable. Wright and Linacre (1994) proposed that the mean square
values of 0.6 to 1.4 are reasonable limits for data gathered using a rating scale.
In this study, the lower and upper quality control limits were set at 0.5 and 1.5,
respectively (Lunz & Stahl, 1990), since this study examines the rating behaviours of
teachers in a classroom setting rather than in a high-stakes test setting. An infit mean
square value less than 0.5 indicated overfit or a lack of variability in their scoring, while
an infit mean square value greater than 1.5 indicated significant misfit or a high degree of
inconsistency in the ratings. Table 17 presents several of the statistics associated with the
teacher facet; in particular, the fifth and sixth columns display the infit and outfit mean
squares for each teacher. All infit and outfit mean squares were within the range of 0.5
and 1.5, indicating that none of the teachers exhibited misfitting or overfitting rating
patterns and that all were internally consistent in their ratings.
Table 17
Teacher Measure Statistics
Teacher Observed
Average
Measure
(logits)
Model
S.E.
Infit
MnSq
Outfit
MnSq
Corr.
PtBis
Exact
Obs %
Agree.
Exp %
Angelina 0.6 -0.35 0.07 1.02 1.02 0.20 65.9 59.7
Ann 0.6 -0.64 0.07 1.01 1.05 0.22 64.1 60.5
Beth 0.5 0.15 0.07 0.89 0.84 0.29 65.9 57.8
Brad 0.6 -0.37 0.07 1.01 1.01 0.25 63.7 60.9
Esther 0.6 -0.93 0.07 1.04 1.15 0.20 63.9 60.4
Susan 0.7 -0.71 0.07 1.02 1.07 0.23 64.7 61.4
Tom 0.6 -0.39 0.07 1.00 0.97 0.25 64.6 60.1
![Page 148: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/148.jpg)
136
Table 17 (Continued)
Teacher Observed
Average
Measure
(logits)
Model
S.E.
Infit
MnSq
Outfit
MnSq
Corr.
PtBis
Exact
Obs %
Agree.
Exp %
M 0.6 -0.46 0.07 1.00 1.01 0.23
SD 0.1 0.32 0.00 0.05 0.09 0.03
RMSE (Model) = 0.07
Separation = 4.38
Fixed (all same) chi-square = 143.2
Significance (probability) = 0.00
Exact agreements: 6,275 = 64.7%
Adj. SD = 0.31
Separation (not inter-rater) Reliability = 0.95
d.f. = 6
Inter-Rater agreement opportunities: 9,701
Expected agreements: 5832.1 = 60.1%
Note. Infit MnSq = Infit Mean Square, Outfit MnSq = Outfit Mean Square, Corr. PtBis = Point-Biserial
Correlation, Exact Obs % = Percentage of Exact Observed Agreement. Agree. Exp % = Percentage of
Expected Agreement, RMSE = Root Mean square Standard Error, Adj. SD = Adjusted Standard Deviation.
A more detailed analysis using the rater effect criteria proposed by Wolfe, Chiu,
and Myford (1999) was conducted to further examine teachers‟ internal consistency.
Wolfe et al. adopted tight quality control indices (mean squares of 0.7 and 1.3 for the
lower and upper limits) and determined rater effect using fit statistics and the proportion
of unexpected ratings for each rater (Zp).27
The combinations of these indices indicate
accurate, random, halo/central, and extreme rating patterns, respectively. According to
Myford and Wolfe (2004b), a random rating pattern occurs when raters use one or more
scales inconsistently compared to other raters, while an extreme rating pattern occurs
when raters assign ratings at the high or low ends of the scale. The halo effect occurs
when raters assign similar ratings to a distinctive trait, and the centrality effect occurs
when raters overuse the middle categories of a rating scale (Myford & Wolfe, 2004b).
Table 18 presents the ways in which the rater effect is determined based upon fit
statistics and Zp indices. The teachers‟ rating behaviour is summarized in the last column,
indicating that all of the teachers exhibited accurate rating patterns.
27
For the discussion about how to compute a Zp index, see Myford and Wolfe (2000).
![Page 149: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/149.jpg)
137
Table 18
Teacher Effect
Rater effect Infit MnSq Outfit MnSq Zp No. of teachers
Accurate 0.7 infit 1.3 0.7 outfit 1.3 Zp 2.00 7
Random infit>1.3 outfit>1.3 Zp >2.00 0
Halo/Central infit<0.7 outfit<0.7 Zp >2.00 0
Extreme 0.7 infit 1.3 outfit>1.3 Zp >2.00 0
Teacher Agreement
Two approaches were used in order to examine the degree of agreement between
teacher assessments. The first used a percentage of exact agreement, which indicated the
percentage of times that each teacher provided exactly the same ratings as another
teacher under identical circumstances. The agreement statistics and expected values were
provided by FACETS. As the eighth column of Table 17 shows, the exact observed
agreement of teachers ranged from 63.7% to 65.9% (M = 64.7%).28
Although this range
does not seem to support the idea of substantial agreement among teachers, it is
reasonable considering that the teachers were not trained as professional raters of ESL
writing. A similar agreement pattern is found in other writing assessment research.
Barkaoui (2008) reported that teachers‟ agreement reached 22.4% when they used a nine-
point holistic rating scale and that their agreement was 23.1% when they used a nine-
point analytic rating scale. When his teacher group was examined further, novice
teachers showed 20.0% agreement, while experienced teachers exhibited 26.3%
agreement. His findings seem to confirm the difficulty of achieving high agreement
among teachers who are not trained as professional assessment raters, echoing this
study‟s finding.
However, when well-trained certified raters are involved in a high-stakes ESL
writing assessment, a fair amount of agreement can be achieved. Knoch (2007) examined
the functionality of two analytic ESL writing scales: Diagnostic English Language Needs
28
According to Linacre (2009), observed exact agreement is defined as “the proportion of times one
observation is exactly the same as one of the other observations for which there are the same circumstances”
(p. 237). The observed exact agreement of 64.7% was therefore computed as (6,275/9,701) 100 = 64.7%.
On the other hand, an expected agreement is defined as an “expected percent of exact agreements between
raters on ratings under identical conditions, based on Rasch measures” (p. 160).
![Page 150: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/150.jpg)
138
Assessment (DELNA) and a newly developed diagnostic scale, and reported somewhat
fair, but still unsubstantial, agreement. The two rating scales were developed to assess
student writing skills and consisted of six levels (in the case of DELNA) and four to six
levels (in the case of the new diagnostic scale). When raters used the DELNA rating
scale, their agreement ranged from 33% to 41.7% (M = 37.92, SD = 2.49); when the new
diagnostic scale was used, agreement ranged from 36.1% (for a six-level scale) to 61.9%
(for a four-level scale) (M = 51.15, SD = 7.94). That the raters were well-trained certified
professionals must have contributed to this fair or moderate agreement, but it still
indicates that it is extremely difficult for raters to achieve substantial agreement on
writing assessments, possibly because of the inherently subjective nature of the task.
The second approach to examining inter-teacher reliability was a correlation
between a single rater and the rest of the raters (SR/ROR). SR/ROR correlation indicates
the degree to which one particular rater (i.e., the single rater) rank-orders examinees in a
manner consistent with all other raters. According to Myford and Wolfe (2004a),
SR/ROR correlations greater than 0.7 are considered high for an assessment in which a
multiple-level rating scale is involved, whereas SR/ROR correlations less than 0.3 are
thought to be somewhat low. Still, they caution that the control limit must be relaxed as
the number of scale categories decreases: for example, they report that SR/ROR
correlations as low as 0.2 are common in dichotomous ratings.29
As the seventh column
of Table 17 illustrates, teachers‟ SR/ROR correlations in this study ranged from 0.20 to
0.29 (M = 0.23, SD = 0.03), suggesting that each teacher rank-ordered students in a
manner similar to that of the other teachers.30
Further analysis was conducted in order to examine the extent to which the
teachers agreed on each individual descriptor. The percentage of teachers‟ ratings that
agreed on each descriptor per essay was calculated, and the mean and standard deviations
of the agreements on 10 essays were examined. Ratings were derived from the 10 essays
in Batch 03 because these essays were assessed by all of the teachers. As Table 19 shows,
teachers had the highest agreement on D16 (word order; agreement = 90%) and exhibited
the lowest agreement on D13 (idea development; agreement = 61.43%). When the
29
An SR/ROR correlation near or less than 0 indicates low inter-rater reliability. 30
SR/ROR correlations are referred to as point-biserial correlations in FACETS analysis.
![Page 151: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/151.jpg)
139
descriptors that elicited high agreement (> 85%) were examined, most were related to
discrete grammar knowledge (e.g., D16, D23, D31, and D32). On the other hand, when
the descriptors that elicited low agreement (< 70%) were examined, they were found to
be associated with global content skills (e.g., D01, D05, D06, D07, and D13). These
results are consistent with Milanovic et al.‟s (1996) findings suggesting that essay
content is the most subjective component element because raters‟ personal reactions
might significantly affect their rating.
Table 19
Teacher Agreement on Descriptors
Descriptor Agreement (%) SD Descriptor Agreement (%) SD
D01 65.71 13.80 D19 77.14 13.80
D02 80.00 12.05 D20 82.86 17.56
D03 74.29 17.56 D21 80.00 21.51
D04 70.00 18.38 D22 77.14 18.07
D05 68.57 14.75 D23 85.71 15.06
D06 65.71 13.80 D24 78.57 15.43
D07 68.57 14.75 D25 84.29 18.38
D08 74.29 13.13 D26 77.14 13.80
D09 80.00 15.36 D27 81.43 15.13
D10 71.43 15.06 D28 77.14 15.36
D11 70.00 14.21 D29 82.86 14.75
D12 75.71 16.56 D30 78.57 13.88
D13 61.43 6.90 D31 87.14 18.38
D14 77.14 16.77 D32 85.71 15.06
D15 78.57 15.43 D33 84.29 15.72
D16 90.00 15.13 D34 72.86 18.38
D17 77.14 18.07 D35 85.71 9.52
D18 77.14 19.28
![Page 152: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/152.jpg)
140
Bias Analysis
A bias analysis was carried out to further explore the interaction between the
teachers and the descriptors. The extent to which a teacher was biased for or against a
particular descriptor was standardized to a z-score in a bias analysis. A teacher with a z-
score between -2 and +2 was considered to be using a descriptor without significant bias.
When the z-score was below -2, the teacher was using that particular descriptor in a
significantly lenient manner compared to how he or she used other descriptors. When the
z-score was greater than +2, the teacher was using that descriptor more severely than he
or she did other descriptors.
Table 20 presents the bias terms between the teachers and the descriptors. A
fixed chi-square test indicated an overall, significant biased interaction between the
teachers and the descriptors, p = .00. When an individual interaction effect was examined,
only few such cases were found: Beth was particularly severe with D03 (conciseness; z =
2.90, p = .01) and D08 (specific ideas and examples; z = 2.75, p = .01), while Susan was
particularly lenient with D27 (word variety; z = -2.80, p < .05) compared to other
descriptors. Except for these specific cases, teachers were not positively or negatively
biased toward any particular descriptors.
Table 20
Interactions between Teachers and Descriptors
Teacher Measr Des Measr Obsvd
Score
Exp.
Score
z-
score
Model
S.E. t d.f. p
Ann -0.64 D16 -1.55 30 26.2 -2.63 1.73 -1.52 29 .14
Beth 0.15 D03 0.64 1 10.3 2.90 1.03 2.83 28 .01
Beth 0.15 D08 0.79 1 9.4 2.75 1.03 2.68 28 .01
Susan -0.71 D27 0.64 28 16.4 -2.80 0.76 -3.71 29 .00
M (Count: 245) 17.7 17.7 -0.02 0.46 0.01
SD (Population) 6.2 4.6 0.83 0.14 1.61
SD (Sample) 6.2 4.6 0.83 0.14 1.61
Fixed (all = 0) chi-square: 633.8 d.f.: 245 significance (probability): .00
When the possible interactions between the prompts and the descriptors were
examined, no bias effect was found for or against either prompt.
![Page 153: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/153.jpg)
141
Correlation between EDD Scores and TOEFL Scores
The writing proficiency measures estimated by the MFRM analysis were
correlated with the scores awarded by ETS raters across the 80 essays. The magnitude of
the correlation was substantial; r = .77, p < .01 for the subject prompt and r = .78, p < .01
for the cooperation prompt. The overall correlation for the entire essays was also
moderately strong (r = .77, p < .01). This result provides some convergent evidence that
the EDD checklist measures the same writing construct measured by the TOEFL iBT
independent writing rating scale; however, further evidence is needed given the limited
nature of correlation to explain a construct issue.
Teacher Perceptions and Evaluations
Teacher Confidence Levels
The degree to which the teachers were confident about their assessments across
the 35 descriptors on 10 essays (5 essays × 2 prompts) is presented in Tables 21 and 22.
Teachers‟ confidence levels showed a similar pattern across the two prompts, with the
mean ranging from 76.57% (D06) to 98% (D33) on the subject prompt and from 78.29%
(D08) to 97% (D34) on the cooperation prompt. They were generally confident in
assessing writing subskills related to D22 (singular and plural nouns), D24 (article use),
D31 (spelling), D33 (capitalization), D34 (indentation), and D35 (tone and register).
These descriptors showed confidence levels greater than 90% across the two prompts.
Teachers appeared less confident in assessing content-related writing skills, with
confidence levels lower than 80% on D05 (strong argument), D06 (enough ideas and
examples), and D07 (logical ideas and examples) across the two prompts. This result
suggests that teachers were more confident using descriptors associated with surface-
level grammatical (D22 and D24) and mechanical features (D31, D33, and D34) than
they were using those related to global content areas (D05, D06, and D07), and that the
subjective nature of the content criteria might have affected their confidence levels.
When confidence levels were examined across teachers, Tom and Ann were consistently
the most confident in using the descriptors (mean confidence > 95%), whereas Brad was
consistently less confident (mean confidence < 80%) on the two prompts.
![Page 154: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/154.jpg)
142
The teacher agreement and confidence level data points were plotted together on
the same graph in order to closely examine their associations (see Figure 3). When
overall trends were examined, teachers notably agreed less on the content-related
descriptors (D04, D05, D06, D07, and D08), as they were less confident in using them.
Teacher agreement thus seems to reflect the teachers‟ confidence levels to some extent,
with one exception: although they expressed their confidence in assessing the writing
skills associated with D34 (indentation), the teachers‟ agreement was not as high as
expected.31
Care should be taken in how the relationship between teacher agreement and
confidence is interpreted.
31
One teacher commented on the follow-up questionnaire that it was not clear how many spaces were
considered appropriate indentation. This issue is revisited in the next section.
![Page 155: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/155.jpg)
143
Table 21
Teacher Confidence (%) on the Subject Prompt
Teacher D01 D02 D03 D04 D05 D06 D07 D08 D09 D10 D11 D12 D13 D14 D15 D16 D17 D18
Angelina 100 70 78 78 82 66 74 66 86 82 72 68 60 84 60 68 62 70
Ann 94 96 100 94 90 94 94 94 100 100 98 100 98 100 100 98 100 100
Beth 92 95 94 90 86 88 88 100 95 95 82 80 68 86 95 72 96 100
Brad 100 65 80 72 60 60 70 70 60 70 80 74 80 85 60 60 60 62
Esther 94 80 74 58 48 52 48 64 92 64 68 66 56 84 66 72 85 85
Susan 100 90 100 70 90 84 84 84 90 100 82 84 100 90 100 90 100 100
Tom 100 100 88 96 94 92 96 94 90 94 100 96 92 100 98 100 97 98
M 97.14 85.14 87.71 79.71 78.57 76.57 79.14 81.71 87.57 86.43 83.14 81.14 79.14 89.86 82.71 80.00 85.71 87.86
Table 21 (Continued)
Teacher Confidence (%) on the Subject Prompt
Teacher D19 D20 D21 D22 D23 D24 D25 D26 D27 D28 D29 D30 D31 D32 D33 D34 D35 M
Angelina 80 70 74 76 68 72 78 78 78 74 54 70 76 78 86 100 100 75.37
Ann 98 100 100 98 100 100 100 98 100 98 100 100 100 100 100 100 99 98.31
Beth 94 95 100 94 85 100 88 90 94 95 86 95 95 89 100 100 89 91.17
Brad 80 80 82 100 80 100 84 100 70 70 60 60 96 66 100 100 100 77.03
Esther 68 94 72 98 74 64 62 100 78 48 82 70 100 78 100 64 90 74.23
Susan 84 100 100 100 100 100 84 70 74 84 90 100 100 100 100 80 100 91.54
Tom 98 100 98 98 100 100 100 100 98 98 94 98 100 100 100 98 100 97.29
M 86.00 91.29 89.43 94.86 86.71 90.86 85.14 90.86 84.57 81.00 80.86 84.71 95.29 87.29 98.00 91.71 96.86 86.42
![Page 156: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/156.jpg)
144
Table 22
Teacher Confidence (%) on the Cooperation Prompt
Teacher D01 D02 D03 D04 D05 D06 D07 D08 D09 D10 D11 D12 D13 D14 D15 D16 D17 D18
Angelina 92 82 80 84 72 70 82 80 82 94 88 78 78 100 78 80 74 74
Ann 92 98 98 92 88 92 98 98 98 88 90 98 94 100 96 100 99 100
Beth 79 85 88 85 87 89 85 88 73 89 86 87 87 86 69 74 75 80
Brad 100 70 70 70 60 70 62 60 60 75 80 70 80 60 70 70 66 60
Esther 74 74 88 80 68 64 62 54 74 64 66 74 70 80 100 88 86 94
Susan 100 100 94 70 90 90 84 76 90 84 96 100 100 100 100 98 100 100
Tom 86 100 100 98 94 84 86 92 100 100 98 100 100 100 100 96 100 100
M 89.00 87.00 88.29 82.71 79.86 79.86 79.86 78.29 82.43 84.86 86.29 86.71 87.00 89.43 87.57 86.57 85.71 86.86
Table 22 (Continued)
Teacher Confidence (%) on Cooperation Prompt
Teacher D19 D20 D21 D22 D23 D24 D25 D26 D27 D28 D29 D30 D31 D32 D33 D34 D35 M
Angelina 80 78 86 92 64 74 80 76 80 74 60 76 64 80 92 92 92 80.23
Ann 100 100 98 100 100 100 100 98 98 100 96 100 100 100 100 100 100 97.40
Beth 74 78 80 75 80 96 78 90 93 88 100 74 77 80 87 87 77 83.03
Brad 80 80 80 80 80 98 80 70 78 90 70 80 100 80 94 100 100 76.94
Esther 100 66 90 92 90 78 82 100 88 78 88 82 100 76 90 100 66 80.74
Susan 100 92 100 96 100 100 92 84 74 84 100 96 100 90 100 100 100 93.71
Tom 98 98 100 98 100 100 100 100 96 98 100 100 100 100 100 100 100 97.77
M 90.29 84.57 90.57 90.43 87.71 92.29 87.43 88.29 86.71 87.43 87.71 86.86 91.57 86.57 94.71 97.00 90.71 87.12
![Page 157: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/157.jpg)
145
Figure 3. The scatter plot for teacher agreement and confidence
Teacher Questionnaire Responses and Interviews
The teachers‟ perceptions when using the EDD checklist were examined. Their
positive and negative reactions as reported on the questionnaire were analyzed
descriptively. The EDD checklist evaluation focused primarily on its (a) clarity, (b)
redundancy, (c) relevance and usefulness, and (d) strengths and weaknesses. When the
teachers were asked about the number of times they read the essays as marked them, two
teachers said “twice”, four teachers said “three times”, and one teacher said “more than
three times”. This result suggests that the EDD checklist prompted the teachers to read
essays carefully so that they could answer all 35 descriptors.
When it came to overall satisfaction with using the EDD checklist in their essay
assessments, two teachers said that they liked the checklist “a little bit”, one teacher liked
it “quite a lot”, and four teachers liked it “very much”. Specifically, five teachers said
that the EDD descriptors were clearly understood, whereas two teachers said that they
were not. Of these two teachers, one pointed out that the words “strong,” “clear,” and
“few” were too subjective to render a yes or no decision, while the other commented that
the descriptors “sophisticated or advanced vocabulary is used (D26)” and “a wide range
of vocabulary is used (D27)” were highly interrelated to the writer‟s educational
background and a specific test context, rendering judgment difficult. She also
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35
Per
cent
(%)
Descriptors
agreement confidence on prompt 1 confidence on prompt 2
![Page 158: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/158.jpg)
146
commented that it was not clear how many spaces were considered appropriate
indentation (D34).
Three teachers reported some redundancy in the EDD descriptors: two noted that
a single descriptor could be created by combining two seemingly similar descriptors “this
essay is written clearly enough to be read without having to guess what the writer is
trying to say (D02)” and “grammatical or linguistic errors in this essay do not impede
comprehension (D19)”. Another teacher commented that there was considerable overlap
between “making run-on sentences or comma splices (D18)” and “misuse of punctuation
marks (D32)” because run-on sentences and comma splices naturally suggest a lack of
appropriate punctuation. These points were directly related to multi-divisibility of ESL
writing skills, as shown in the Q-matrix construction. Along the same lines, an essay‟s
clarity can be partially achieved by writing error-free text, as the knowledge of
punctuation usage can partially prevent writers from creating run-on sentences or comma
splices.
Although all seven teachers agreed that the EDD checklist was useful and
relevant for assessing ESL academic writing ability, two teachers pointed out that the
EDD checklist was not comprehensive enough to capture all circumstances in ESL
academic writing. One teacher suggested that more descriptors related to content
development and argument presentation should be included in the checklist, and another
suggested that the ability to paraphrase or create pre-writing strategies should also be
assessed.
The teachers‟ evaluations of the EDD checklist were examined from a slightly
different perspective. They were asked to judge the relative importance of the descriptors
in developing students‟ ESL academic writing. The results indicated that most teachers
felt that the descriptors associated with content development and organization were much
more important than the others (such as punctuation) because the fundamental goal of
academic writing is to make an effective and persuasive argument. This argument echoed
the need for differential weighting on the descriptors, since certain descriptors might not
be as important as others that assess the core construct of ESL academic writing.
Of particular interest were the teachers‟ comments on the strengths and
weaknesses of the EDD checklist. Their open-ended responses highlighted a variety of
![Page 159: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/159.jpg)
147
important issues. Overall, the teachers thought that the EDD checklist covered many
important elements of ESL academic writing and appreciated that the checklist enabled
them to view an essay in a comprehensive and detailed manner. One teacher commented
that “itemization of writer skills greatly helped to focus on what to look for during the
assessment.” Ironically, comprehensiveness was also considered a weakness: three
teachers said that the checklist was “too long” and “too time-consuming” to be
implemented in a classroom assessment. Conflicting opinions also existed with regard to
the use of binary choice: two teachers felt that this method was too limited to allow for
consistent decisions. By contrast, another teacher called binary choice the checklist‟s
genuine strength for its ability to facilitate accurate and fast decisions. Indeed, five
teachers reported on the Likert-scale questionnaire items that the EDD checklist was
conducive to making a binary-choice, while two did not agree.
In addition to the lengthy process and lack of scale, weighting was another
important issue raised by the teachers. Two rightly argued that certain descriptors must
be weighted more heavily than others to better reflect a student‟s overall writing
competence; Brad pointed out, for example, that the ability to make strong main
arguments is a more important writing skill than the ability to use capital letters and thus
deserves greater weighting:
I think the descriptors sometimes don‟t give an accurate reflection. For example,
some essays I graded were poor, but scored well, because the capitalizations
were fine or punctuations were okay. But, they missed the fundamental areas of
academic writing highlighted in descriptors 1-14, for example. (Brad)
Angelina also correctly noted that a writer might be unfairly penalized or
rewarded simply on the grounds that he or she did not employ a specific writing device:
I also found descriptor #29 slightly problematic in that in some cases I found it
hard to determine the test-taker‟s grasp of collocations and idiomatic expressions
because they rarely or simply did not use them in their answer. (Angelina)
She further questioned whether the EDD checklist considers both the frequency and the
nature of errors. As she rightly argued, consistent elementary spelling mistakes should be
treated differently than one single serious spelling mistake.
Despite these limitations, all of the teachers expressed positive views of the EDD
checklist‟s diagnostic function and appreciated its positive impact on student learning
and teacher instruction. The EDD checklist was thus determined to function as intended
![Page 160: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/160.jpg)
148
and was confirmed for use in the main study. No revisions were made to the checklist,
not only because not all teachers raised the same problem, but because they often had
contradictory opinions. Even if the descriptors had been revised, it was unknown how the
revisions would affect the teachers‟ assessment behaviors or perceptions. In addition,
psychometric problems such as determining the relative weight of each descriptor could
not be approached without a precise procedure based upon empirical evidence. Creating
additional descriptors that took different facets of writing skills or language errors into
account was also a daunting task, given the already high number of descriptors. Instead,
in-depth rater training was held during the main study, using slightly revised assessment
guidelines in order to help teachers fully understand and effectively employ the checklist.
Summary
This chapter has examined three validity assumptions centered on the
preliminary evaluation of the EDD checklist. Each assumption was carefully examined
based upon multiple pieces of empirical evidence. The study‟s findings provided a
somewhat fuzzy picture of the generalizability of the scores derived from the checklist;
agreement rates among teachers were not substantially high in spite of high intra-teacher
reliability. The high correlation between EDD scores and TOEFL scores provided
convergent evidence for use of the checklist; however, this criterion-related validity
claim should be interpreted carefully because the two rating rubrics were developed for
different test purposes. Although the two sets of scores were highly correlated, divergent
evidence could indicate underlying differences between the two rubrics. Overall teacher
confidence and evaluation further justified the validity claims for the use of the EDD
checklist. Most teachers used it without much difficulty and valued its diagnostic
function. The chain of validity inquiries in this chapter evidenced the overall usability of
the EDD checklist and ensured its suitability for use in the main study. The next chapter
further discusses the primary evaluation of the EDD checklist.
![Page 161: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/161.jpg)
149
CHAPTER 6
PRIMARY EVALUATION OF THE EDD CHECKLIST
Introduction
This chapter discusses the primary evaluation of the EDD checklist conducted in
Phase 2. The overall results of the pilot study were positive, allowing the checklist to be
used in modeling diagnostic writing skill profiles in the main study. Ten ESL teachers
assessed 480 TOEFL iBT independent essays using the checklist, and then evaluated its
use in the questionnaire and interviews. Both quantitative and qualitative data were
collected and analyzed in order to examine the three validity assumptions:
The EDD checklist provides a useful diagnostic skill profile of ESL academic
writing (Characteristics of the diagnostic ESL academic writing skill profiles).
Performance on the EDD checklist is related to performance on other measures
of ESL academic writing (Correlation between EDD scores and TOEFL scores).
The EDD checklist helps teachers make appropriate diagnostic decisions and has
the potential to positively impact teaching and learning ESL academic writing
(Teacher perceptions and evaluations).
Each facet of the assumptions addressed the different aspect of validity arguments and
provided valuable information used to justify the score-based interpretation and use of
the EDD checklist. The empirical evidence needed to examine each validity assumption
is discussed below.
Characteristics of the Diagnostic ESL Academic Writing Skill Profiles
Dimensional Structure of the EDD Checklist
The dimensional structure of the EDD checklist was analyzed both substantively
and statistically prior to the examination of its diagnostic capacity. The substantive
analysis was carried out based on the outcome of the descriptor sorting activity
conducted by the ESL experts. The results of the substantive analysis were used to
construct a Q-matrix. The statistical analysis was conducted using a series of conditional
covariance-based nonparametric dimensionality techniques. The results of both
substantive and statistical dimensionality analyses informed the extent to which the test
construct was multidimensional in relation to the assumption of the diagnostic
assessment model.
![Page 162: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/162.jpg)
150
Substantive Dimensionality Analysis
Four ESL academic writing experts independently sorted the refined EDD
descriptors into dimensionally distinct ESL writing skills using their own skill
configuration. Each expert had a different conceptualization of ESL writing skills and
produced a different skill categorization scheme. Gary divided the descriptors into five
categories: (a) organization, (b) grammar, (c) vocabulary, (d) style, and (e) formatting,
commenting that vocabulary knowledge is closely related to writing style and that all
ESL writing skills are intertwined with each other. Gary also argued for a holistic
interpretation of writing, pointing out that it was difficult to analytically distinguish one
skill from the others, and that the whole is not always the sum of its parts.
Jane‟s categorization was particularly interesting, as she conceptualized ESL
writing skills from a hierarchical perspective, with a skill identification scheme layered
according to its (a) word, (b) sentence, (c) paragraph, and (d) essay components. From a
slightly different perspective, Anthony categorized descriptors into (a) idea development,
(b) organization, (c) language use, (d) vocabulary, and (e) punctuation, noting that idea
development is associated with the meaning of written text whereas organization is
focused on the form of written text. Anthony also suggested subdividing the language use
category into global and local levels. Alex‟s categorization scheme was similar to those
of Gary and Anthony, incorporating (a) content, (b) organization, (c) grammar, and (d)
mechanics. Overall, the experts‟ skill identification results indicated that the
predetermined sorting categories used in this study were comprehensive and compatible
with their empirical skill configurations.
After the experts had constructed their own skill schemes, they were asked to
identify skills-by-descriptors relationships using the predetermined skill categories
including (a) content fulfillment, (b) organizational effectiveness, (c) grammatical
knowledge, (d) vocabulary use, and (e) mechanics, so that a Q-matrix could be
constructed. Before beginning the sorting task, the four writing experts agreed that these
five writing skills described the characteristics of descriptors well, and represented the
construct of ESL academic writing. Table 23 shows the ways in which the experts related
the descriptors to a specific writing skill. It also indicates that while experts assigned a
single skill to most descriptors, multiple skills were assigned in some cases. Different
![Page 163: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/163.jpg)
151
experts appeared to have different conceptualizations of content fulfillment and
organizational effectiveness (see the skill assignments to D01-D14) and grammatical
knowledge and mechanics (see the skill assignments to D17, D18, D31, and D33).
When the experts‟ agreement was examined, it was found that they had achieved
100% agreement on the 20 descriptors. The areas in which most discrepancy was
exhibited were D2, D3, D14, and D35, descriptors which focused on more holistic
assessment of an essay‟s general quality. There was considerable disagreement on D35,
which assessed the tone and register of an essay, because it could have been mastered by
appropriate use of vocabulary or the consistent interplay of all aspects of ESL writing
skills throughout the essay. It was interesting that Anthony suggested that the grammar
category be subdivided into lexical, sentence, and discourse levels or correctable and
non-correctable error aspects; this suggestion was convincing considering that the
grammar skill included the greatest number of descriptors. However, further analysis was
not conducted in order to keep the grain size of all the skills consistent.
A Q-matrix was finally constructed based on the sorting activity outcomes. Each
skill-by-descriptor correspondence was reviewed, taking all of the experts‟ opinions into
account. As many skills as possible were assigned to a descriptor if a different skill was
noted by the experts, and relevant ESL writing literature was consulted as a final
judgment call. The initial Q-matrix entry can be found in the last column of Table 23.
Twenty-one descriptors were matched to one single skill, and the remaining 14 were
matched to multiple skills. Grammatical knowledge contained the greatest number of
descriptors, while vocabulary use and mechanics contained a relatively smaller number
of descriptors. Since students greatly desire feedback on grammatical problems in their
writing (Cohen & Cavalcanti, 1990; Ferris, 1995; Hedgcock & Lefkowitz, 1994; Leki,
1991), the large number of descriptors in this area is reasonable; however, the relatively
small number of descriptors in vocabulary use and mechanics was somewhat problematic
because it can cause instability of parameter estimates. The initial Q-matrix subjected to
diagnosis modeling can be found in Appendix O.
![Page 164: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/164.jpg)
152
Table 23
Experts‟ Descriptor Classification
Descriptor Gary Jane Anthony Alex Q-matrix Entry
D01 CON CON CON CON CON
D02 ORG CON, ORG*
CON,
GRM,ORG ORG CON, ORG
D03 VOC CON,VOC,ORG CON GRM,ORG,
CON
CON, ORG,
VOC
D04 CON CON, ORG ORG ORG,CON CON, ORG
D05 CON CON CON,ORG CON CON, ORG
D06 CON CON CON CON CON
D07 ORG CON CON ORG CON, ORG
D08 CON CON CON CON CON
D09 ORG ORG ORG ORG ORG
D10 ORG ORG ORG ORG ORG
D11 ORG ORG ORG,CON ORG ORG, CON
D12 ORG ORG ORG ORG ORG
D13 ORG ORG CON CON ORG, CON
D14 ORG ORG ORG ORG,GRM,
VOC
ORG, GRM,
VOC
D15 GRM GRM GRM GRM GRM,
D16 GRM GRM GRM GRM GRM
D17 GRM GRM,MCH GRM GRM GRM, MCH
D18 GRM GRM,MCH GRM GRM GRM, MCH
D19 GRM GRM GRM GRM GRM
D20 GRM GRM GRM GRM GRM
D21 GRM GRM GRM GRM GRM
D22 GRM GRM GRM GRM GRM
D23 GRM GRM GRM GRM GRM
D24 GRM GRM GRM GRM GRM
D25 GRM GRM GRM GRM GRM
D26 VOC VOC VOC VOC VOC
D27 VOC VOC VOC VOC VOC
D28 VOC VOC VOC VOC VOC
D29 GRM VOC VOC GRM VOC, GRM
![Page 165: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/165.jpg)
153
Table 23 (Continued)
Descriptor Gary Jane Anthony Alex Q-matrix Entry
D30 VOC GRM GRM GRM GRM, VOC
D31 MCH MCH GRM MCH MCH, GRM
D32 MCH MCH MCH MCH MCH
D33 MCH MCH GRM MCH MCH, GRM
D34 MCH MCH MCH MCH MCH
D35 VOC CON,GRM,
VOC
VOC,ORG GRM,MCH VOC, GRM,
CON, ORG,
MCH
Note. When multiple skills are assigned to a descriptor, a primary skill appears before a secondary skill.
For example, the notation of CON, ORG indicates that CON is a primary skill and ORG is a secondary
skill. CON=content fulfillment, ORG=organizational effectiveness, GRM=grammatical knowledge,
VOC=vocabulary use, and MCH=mechanics.
Statistical Dimensionality Analysis
An exploratory DIMTEST analysis resulted in the rejection of the null
hypothesis of unidimensionality with an extremely small p-value, T = 7.28, p < .00.
Twelve descriptors were selected as an initial AT set by the program, including six CON
descriptors (D01, D03, D04, D05, D07, and D08) and six ORG descriptors (D09, D10,
D11, D12, D13, and D14). The subsequent exploratory DIMTEST analysis failed to
reject the null hypothesis, suggesting that CON and ORG skills are dimensionally
distinct from GRM, VOC, and MCH skills.
DETECT was then performed in an exploratory manner in order to estimate the
number of dimensions present in the data and the magnitude of the multidimensionality.
As Table 24 shows, the descriptors were separated into four clusters maximizing the
DETECT index. Consistent with the results of the exploratory DIMTEST analysis, the
CON and ORG descriptors (D01-D14) constituted the first cluster. The DETECT value
was noticeably large (DETECT index = 1.25), indicating strong evidence of
multidimensionality. In addition, the IDN and r indices were close to 1 (IDN index =
0.82 and r index = 0.79), indicating that the approximate simple structure held true for
the data.
![Page 166: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/166.jpg)
154
Table 24
Descriptor Clusters Identified by DETECT
Cluster Descriptor
1 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
2 15, 17, 26, 27
3 16, 18, 19, 20, 21, 22, 23, 24, 25, 28, 29, 30, 31, 32, 33
4 34, 35
An exploratory CCPROX/HCA analysis was also performed to visually examine
the most interpretable cluster solution in the data. The primary skill of each descriptor
was analyzed in order to represent the skills-by-descriptors relationship. Figure 4
displays part of the CCPROX/HCA output from Levels 18 to 34. Each column illustrates
one level of cluster analysis, with descriptors within the different clusters separated by
asterisks (***). Visual inspection suggested that the five-cluster solution is likely to be
the most interpretable. From Level 22, the CON descriptors began to form one large
cluster without being disjointed by other skill descriptors. The ORG and VOC
descriptors were also found to form two distinct clusters from the early stage of the HCA
solution. Although GRM and MCH descriptors showed some fuzzy areas within their
clusters, they also appeared to possess separate dimensions.
![Page 167: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/167.jpg)
155
Figure 4. CCPROX/HCA results
![Page 168: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/168.jpg)
156
The dimensional hypothesis developed by using exploratory methods was further
tested by using a confirmatory DIMTEST, which examined whether the data held
multidimensionality manifesting the five identified writing skills (CON, ORG, GRM,
VOC, and MCH). In Table 25, the five DIMTEST runs show five rejections of the null
hypothesis of unidimensionality, indicating that the five writing skills are statistically
significant distinct dimensions, p < .01. The dimensionality test statistics, T, further
indicated that CON, ORG, and GRM had a greater magnitude of multidimensionality
than VOC and MCH.
Table 25
Confirmatory DIMTEST Results
Writing skill No. of descriptors T p
CON 8 6.0456 0.00
ORG 6 8.5905 0.00
GRM 12 6.4971 0.00
VOC 5 3.3222 0.00
MCH 4 5.3902 0.00
A set of exploratory and confirmatory dimensionality analyses determined that
ESL academic writing ability comprises five distinct skills, CON, ORG, GRM, VOC,
and MCH, and proved that the underlying dimensional structure of CON and ORG was
distinctively different from that of GRM, VOC, and MCH. This result is consistent with
ESL academic writing theories that characterize writing ability as a constellation of
multiple skills. The dimensionality results further confirmed that the initial Q-matrix
specifying the relationship between the five writing skills and the descriptors was
reasonable, making it possible to begin estimating a diagnostic model.
Diagnostic Function of the EDD Checklist
The diagnostic model of the EDD checklist was examined from a variety of
perspectives, beginning with an examination of model convergence, then proceeding to
parameter estimation and model fit. Diagnostic skills mastery profiles were then
constructed in order to examine its diagnostic capacity.
![Page 169: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/169.jpg)
157
Evaluation of Model Convergence
The Markov Chain length of 20,000 was used with a burn-in length of 10,000.
Figures 5 and 6 show the three different types of plots that were visually inspected to
determine whether steady state had been achieved in the Markov Chain. Two model
parameters, pMCH (pk parameter estimate for MCH; i.e., the proportion of masters for
MCH) and r (r parameter estimate for ORG of D02) were used as an example because
these two parameters showed the most unstable and jumpy chains after the burn-in phase.
The first graphs in Figures 5 and 6 illustrate a density plot, in which one dominant mode
is indicative of model convergence. The second graphs are a time-series plot, which
signals non-convergence when the chains are jumpy or monotonic. The third graphs are
an autocorrelation plot, in which slow convergence is indicated by autocorrelations
greater than 0.2 for lags smaller than 200.
A visual inspection of the plots for pMCH in Figure 5 indicates that although a
slightly unstable and jumpy pattern was observed in the time-series plot, the overall skill
estimation was considered to have converged: a unimodal was found in the density plot,
and the autocorrelations were low after the burn-in phase. Figure 6 illustrates another
possible case of slow convergence in r . Despite the slightly jumpy pattern observed in
the time-series plot, the overall descriptor estimation appeared to have converged as
indicated by the other two plots.
Figure 5. Density, time-series, and autocorrelation plots for pMCH
Figure 6. Density, time-series, and autocorrelation plots for r
Each of the model parameter estimates was examined in this way in order to
determine the convergence. Although a few parameters exhibited evidence of slow
*
2.2
*
*
2.2
*
2.2
![Page 170: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/170.jpg)
158
convergence, as was the case with pMCH and r , the overall pattern of the Markov Chain
plots suggests that convergence had occurred for most of the parameter estimates.
Evaluation of Parameter Estimation
As model convergence had been achieved, the descriptor parameter estimates
were evaluated substantively and statistically to determine the diagnostic quality of each
descriptor relative to its required skills. Table 26 presents a list of initial parameter
estimates, and r , for the 35 descriptors. When the parameter, , was inspected, it
was found that D34 had a value less than 0.6, indicating unlikelihood that students
would correctly execute MCH to appropriately indent the first sentence of each
paragraph in their writing even if they had mastered the skill. Although the MCH skill‟s
weak association with the ability to indent was suspected of contributing to the low
value, the reassignment of the skill was not conducted because of the lack of substantive
evidence supporting such a revision.
Table 26
Initial Descriptor Parameter Estimates
Descriptor r r r r r
D01 0.87 0.68
D02 0.94 0.47 0.81
D03 0.94 0.46 0.69 0.81
D04 0.88 0.74 0.61
D05 0.98 0.10 0.80
D06 0.77 0.46
D07 0.95 0.29 0.86
D08 0.86 0.28
D09 0.91 0.52
D10 0.80 0.06
D11 0.91 0.85 0.30
D12 0.91 0.45
D13 0.83 0.43 0.35
D14 0.78 0.37 0.82 0.88
D15 0.89 0.62
D16 0.99 0.65
D17 0.90 0.78 0.66
*
2.2
*
d*
dk
*
d
*
d
*
d
* *
CON
*
ORG
*
GRM
*
VOC
*
MCH
![Page 171: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/171.jpg)
159
Table 26 (Continued)
Descriptor r r r r r
D18 0.87 0.77 0.48
D19 0.84 0.29
D20 0.86 0.46
D21 0.93 0.74
D22 0.96 0.73
D23 0.92 0.56
D24 0.87 0.51
D25 0.91 0.76
D26 0.77 0.12
D27 0.97 0.30
D28 0.83 0.67
D29 0.87 0.27 0.28
D30 0.87 0.41 0.97
D31 0.81 0.60 0.48
D32 0.88 0.27
D33 0.97 0.89 0.59
D34 0.54 0.82 0.67
D35 0.94 0.95 0.94 0.96 0.96 0.80
Of great interest were descriptor parameters with a high r value. Six
parameters exhibited a r value greater than 0.9, making them candidates for
elimination from the initial Q-matrix entries. These parameters were revisited in order to
inspect the model convergence, descriptors-by-skills relationship (ratio), and importance
of each skill to a particular descriptor, after which they were dropped from the Q-matrix
entries in a step-wise manner. Most parameters were removed from D35 because the four
skills, CON, ORG, GRM, and VOC, were found to be non-informative in accurately
estimating the ability to use appropriate tone and register. The insignificant contribution
of VOC was somewhat unexpected because vocabulary knowledge has long been
considered to have an association with tone and register from a theoretical point of view.
The finalized descriptor parameter estimates are presented in Table 27. Most of
the values were close to 1 (except for D34), supporting the robustness of the skills
diagnosis modeling. The low r values were also indicative of high diagnostic power,
suggesting that the parameters contribute much information for distinguishing masters
from non-masters for a particular skill.
* *
CON
*
ORG
*
GRM
*
VOC
*
MCH
*
dk
*
dk
*
d
*
dk
![Page 172: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/172.jpg)
160
Table 27
The Final Descriptor Parameter Estimates
Descriptor r r r r r
D01 0.87 0.68
D02 0.93 0.48 0.80
D03 0.94 0.47 0.69 0.80
D04 0.88 0.72 0.64
D05 0.97 0.10 0.81
D06 0.77 0.46
D07 0.95 0.29 0.88
D08 0.86 0.28
D09 0.91 0.51
D10 0.79 0.05
D11 0.90 0.84 0.29
D12 0.90 0.45
D13 0.82 0.43 0.34
D14 0.78 0.36 0.81 0.88
D15 0.89 0.62
D16 0.99 0.65
D17 0.91 0.80 0.68
D18 0.88 0.80 0.53
D19 0.83 0.30
D20 0.86 0.46
D21 0.93 0.74
D22 0.96 0.72
D23 0.92 0.54
D24 0.87 0.51
D25 0.91 0.76
D26 0.76 0.12
D27 0.97 0.30
D28 0.82 0.69
D29 0.88 0.24 0.28
D30 0.87 0.39
D31 0.82 0.66 0.51
D32 0.90 0.42
D33 0.96 0.59
D34 0.54 0.84 0.74
D35 0.90 0.76
Note. “” indicates the Q-matrix entries that were dropped due to non-significance.
* *
CON
*
ORG
*
GRM
*
VOC
*
MCH
![Page 173: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/173.jpg)
161
Once the descriptor parameter estimates were evaluated, the skill parameter
estimates were inspected. Figure 7 presents the proportion of masters (pk) across the five
skills in the examinee population. MCH had the highest proportion of masters
(pMCH=0.62), whereas VOC had the lowest proportion of masters (pVOC=0.46).
Considering that skills with low pk correspond to skills that are expected to be difficult
and skills with high pk correspond to skills that are expected to be easy, the result was
interpreted to determine the difficulty hierarchy of the five writing skills. Therefore,
VOC (pVOC=0.46) was considered the most difficult skill, followed by CON (pCON=0.50),
GRM (pGRM=0.53), ORG (pORG=0.58), and MCH (pMCH=0.62). This hierarchy pattern was
consistent with research findings in ESL writing indicating that while ESL learners may
have to make a substantial effort to expand their vocabulary, they acquire mechanical
writing conventions relatively easily (Leki & Carson, 1994; Raimes, 1985; Silva, 1992).
It was also reasonable that content fulfillment was the second most difficult skill, given
that presenting ideas in a logical piece of writing is a cognitively demanding task.
Figure 7. Proportion of skill masters (pk)
Evaluation of Model Fit
As the parameter estimates were determined to be satisfactory, the model fit was
examined using posterior predictive model checking methods. Figure 8 compares the fit
between observed and predicted score distributions. While the predicted score
distributions approximated the observed score distributions, misfit was found at the
lowest and highest distributions, indicating that the model overestimated the low-level
CON (0.50)
ORG (0.58)
GRM (0.53)
VOC (0.46)
MCH (0.62)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Pro
po
rtio
n o
f M
aste
rs (
%)
Skills
![Page 174: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/174.jpg)
162
students and underestimated high-level students. Nonetheless, the misfit was considered
negligible because the overestimated low-level students were still classified as non-
masters on all skills and the underestimated high-level students were still classified as
masters on all skills. The Mean Absolute Difference (MAD) between predicted and
observed item proportion-correct scores was also 0.0027, a negligibly small value
supporting the claim of good fit.
Figure 8. Observed and predicted score distributions
The overall goodness of the model fit was also evaluated by examining whether
a monotonic relationship existed between the number of mastered skills and observed
total scores. A monotonic relationship was assumed to be an indication of good fit.
Figure 9 presents the relationship between the two variables for the 480 students. Each
data point „ ‟ represents a cluster of students, and a master was determined when a
student‟s posterior probability of mastery (ppm) for a skill was greater than 0.6. The
linear relationship between the two variables supported the claim of good fit as
evidenced by the great magnitude of the positive association, Pearson product-moment
correlation coefficient, r = .915, p < .00.
![Page 175: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/175.jpg)
163
Figure 9. The relationship between the number of mastered skills and the total scores
Evaluation of Diagnostic Quality
Performance difference between masters and non-masters
As model convergence and fit were achieved, diagnostic capacity was examined
by comparing the proportion-correct scores of masters and non-masters across the 35
descriptors. A drastic performance difference between masters and non-masters was
assumed to be strong evidence of the descriptors‟ good diagnostic capacity. Figure 10
shows that the descriptor masters have performed decidedly better than descriptor non-
masters. The proportion-correct score differences between masters and non-masters
ranged from 0.14 to 0.82 across the 35 descriptors with the mean of 0.49, suggesting that
the descriptors distinguished masters from non-masters well.
Figure 10. Performance difference between descriptor masters and non-masters
0
5
10
15
20
25
30
35
40
0 1 2 3 4 5
To
tal
Sco
re
Number of Mastered Skills
0.0
0.2
0.4
0.6
0.8
1.0
0 5 10 15 20 25 30 35
Pro
po
rtio
n C
orr
ect
Descriptors
Masters Nonmasters
![Page 176: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/176.jpg)
164
Although the overall diagnostic capacity of the descriptors was satisfactory, an
in-depth analysis was conducted on individual descriptors that were suspected of having
poor diagnostic power. Descriptors were identified with proportion-correct score
differences between masters and non-masters of less than 0.4. Table 28 lists these
descriptors with their proportion correct scores (p-values), s, and r s. Approximately
34% of the total descriptors exhibited poor diagnostic power. The descriptive analysis
indicated that these descriptors were relatively easy compared to other descriptors, with
proportion-correct scores greater than the mean of 0.64 (see the second column for the p-
values for all total students). The relatively high r values were problematic, suggesting
low discriminant function in determining masters and non-masters. D34 in particular
exhibited the poorest diagnostic power, with a proportion-correct score difference
between masters and non-masters of only 0.14 and an extremely low value (0.54) and
high r values (0.84 for ORG and 0.74 for MCH), as well.
Table 28
Descriptors with Poor Diagnostic Power
Descriptor p
r Total Masters Non-masters
D01 0.73 0.87 0.59 0.87 0.68 (CON)
D04 0.65 0.88 0.50 0.88 0.72 (CON), 0.64 (ORG)
D15 0.74 0.90 0.54 0.89 0.62 (GRM)
D16 0.83 1.00 0.62 0.99 0.65 (GRM)
D17 0.73 0.92 0.60 0.91 0.80 (GRM), 0.68 (MCH)
D18 0.67 0.90 0.50 0.88 0.80 (GRM), 0.53 (MCH)
D21 0.82 0.94 0.68 0.93 0.74 (GRM)
D22 0.84 0.96 0.69 0.96 0.72 (GRM)
D25 0.81 0.92 0.69 0.91 0.76 (GRM)
D28 0.68 0.85 0.56 0.82 0.69 (VOC)
D34 0.46 0.52 0.38 0.54 0.84 (ORG), 0.74 (MCH)
D35 0.82 0.92 0.65 0.90 0.76 (MCH)
* *
*
*
*
* *
![Page 177: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/177.jpg)
165
Accuracy of skill mastery classification
a. Number of skill masters
As the overall diagnostic capacity appeared to be satisfactory, writing skill
profiles were constructed by classifying students into master, non-master, or
undetermined groups. Students with a posterior probability of mastery (ppm) greater than
0.6 for a skill were classified as masters of that skill. Those with a ppm lower than 0.4
were classified as non-masters, and those with a ppm between 0.4 and 0.6 were
undetermined (i.e., neither masters nor non-masters). The mastery classification
mirroring the proportion of skill masters (pk) was assumed to support the accuracy of the
diagnostic model. Figure 11 presents the skill mastery classifications for 480 students.
The greatest number of students (n=290) mastered MCH, whereas the smallest number of
students (n=203) mastered VOC. Along the same lines, the smallest number of students
(n=156) did not master MCH, whereas the greatest number of students (n=244) did not
master VOC. This result echoes the findings of the proportion of skill masters (pk)
discussed in Figure 7: the highest probability of mastery was found in MCH (pMCH=0.62),
and the lowest probability of mastery was found in VOC (pVOC=0.46).
Figure 11. Classification of skill mastery
b. Skills probability distribution
The skills mastery classification was further examined on simulated examinee
item response data (n=100,000 simulees). If the estimated model had high diagnostic
231
272248
203
290
238
194216
244
156
11 14 1633 34
0
50
100
150
200
250
300
350
CON ORG GRM VOC MCH
Num
ber
of
Stu
den
ts
Skills
Masters Nonmasters Undetermined
![Page 178: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/178.jpg)
166
function, it was assumed to generate various types of skills mastery profiles, reducing the
possibility of flat skill profiles. The built-in Arpeggio program, Simarpeggulator,
estimated a population probability distribution on the space of all possible 0 (non-
mastery) and 1 (mastery) skill mastery level profile vectors. As the number of skills was
K = 5, the joint population skills distribution consisted of 32 possible mastery profiles
with 0 and 1 vectors: (00000), (00001), (00010), (00100),…, (11111). Figure 12
summarizes the distribution of students across different numbers of mastered skills,
illustrating that an approximately similar proportion of students (ranging from 11.15% to
21.40%) were distributed across the skill categories. The difference between the zero-
skill mastery profile (in which the smallest proportion of students were assigned) and
three-skill mastery profile (in which the largest proportion of students were assigned)
was only 10.25%. It was also notable that the flat skill profiles did not dominate other
skill profiles. As the graph clearly illustrates, students who fell into the flat categories
accounted for only 11.15% and 19.29% of the total, respectively, indicating the high
discriminant function of the estimated skills diagnostic model.
Figure 12. Distribution of the number of mastered skills
c. Most common skill mastery patterns
Figure 13 presents the most common skill mastery pattern in each number of
skill mastery categories. When students mastered only one skill, it was more often
“00001” than other skill profiles such as “10000”, “01000”, “00100”, “00010”, or
11.15
16.14 16.05
21.40
15.98
19.29
0
5
10
15
20
25
0 skill 1 skill 2 skills 3 skills 4 skills 5 skills
Pro
po
rtio
n o
f S
tud
ents
(%
)
Number of Mastered Skills
![Page 179: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/179.jpg)
167
“00001”. When students mastered two skills, “01001” was the most prevalent skill
mastery pattern. By inspecting the most common skills mastery patterns, it was expected
that the skill difficulty estimated by the diagnostic model would be further supported or
rejected. If the skill difficulty identified in the previous analyses did not hold, the
diagnostic accuracy of the model would have been suspect. Figure 13 shows that students
tended to master easy skills, such as MCH (00001) and ORG (01001), before they
mastered more difficult skills. Of the five skills, VOC was the last skill that students
mastered as indicated by the mastery pattern “11101”. This skill development was
consistent with the skill difficulty (pk) discussed in Figure 7, confirming that VOC is the
most difficult and MCH is the easiest skill. It is also interesting that the six skill mastery
patterns shown in Figure 13 are among the seven most frequent profiles, further
confirming the hierarchy of skill difficulty.
Figure 13. The most common skill mastery pattern in each number of skill mastery
categories
Consistency of skill mastery classification
The diagnostic quality of the estimated model was also evaluated focusing on the
consistency of the skill classification. The built-in Arpeggio program, Tabulator, used
simulated examinee item response data (n=100,000 simulees) to calculate (a) the
proportion of times that each student was classified correctly on the test according to the
known true skill state (correct classification rate: CCR), (b) the proportion of times each
11.15
6.21 5.38
3.50 4.40
19.29
0
5
10
15
20
25
00000 00001 01001 11001 11101 11111
Pro
po
rtio
n o
f S
tud
ents
(%
)
Skill Mastery Pattern
![Page 180: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/180.jpg)
168
student was classified the same on the two parallel tests (test-retest consistency: TRC),
and (c) classification agreement adjusted for by chance. Table 29 presents several
reliability indices of the skill classification using simulated examinee item response data
(n=100,000 simulees). The overall CCR and TRC values were high (M = 0.94 for overall
CCR and M = 0.89 for overall TRC), supporting the consistency of the skill classification.
In particular, CON reported the highest reliability indices among others and MCH
reported the lowest reliability indices. Cohen‟s kappa statistics also echoed the same
results, with substantially high agreement rates across the five skills.
Table 29
Consistency Indices of Skill Classification
Skill Overall
CCR (%)
CCR for
masters (%)
CCR for non-
masters (%)
Cohen‟s
kappa
Overall
TRC
TRC for
masters
TRC for non-
masters
CON 0.97 0.97 0.97 0.94 0.94 0.94 0.95
ORG 0.96 0.96 0.95 0.91 0.92 0.92 0.90
GRM 0.96 0.97 0.95 0.92 0.92 0.94 0.91
VOC 0.93 0.93 0.94 0.87 0.88 0.87 0.88
MCH 0.88 0.93 0.81 0.75 0.80 0.87 0.69
M 0.94 0.95 0.93 0.88 0.89 0.91 0.87
Tabulator also calculated the proportion of examinees whose estimated mastery
classification was correct. In other words, it examined the probability of which an
examinee is estimated as a master when truth is a master, and estimated as a non-master
when truth is a non-master for the mastered skills. Table 30 demonstrates that 96.6% of
the simulees had none or only one error in their estimated skill profiles, indicating that
having more than one incorrect skill mastery classification is very unlikely. The high
correct estimation rates thus further confirmed that the diagnostic skill profiles generated
by the model are reliable.
Table 30
Proportion of Incorrect Patterns Classified by the Number of Skills
No. of mastered skills 0 1 2 3 4 5
Proportion of students (%) 74.0 22.6 03.1 0.3 0.0 0.0
![Page 181: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/181.jpg)
169
Skill mastery profiles across different essay prompts
a. Proportion of skill masters
The diagnostic quality of the estimated model was further examined by focusing
on the extent to which the skill mastery profiles are constructed differently across
different essay prompts. A diagnostically robust model was assumed to generate stable
skill mastery profiles without being affected by the method effect. Figure 14 shows the
proportion of masters across the five writing skills for the subject and cooperation
prompts. The graph clearly illustrates that the two prompts attained a similar proportion
of masters for all the skills but MCH. Although the cooperation prompt showed a bit
higher proportion of masters for CON, GRM, and VOC, the difference appeared to be
negligible; however, the difference in the mastery proportion for MCH appeared to be
substantial (13.33%), indicating that skill‟s unsteady function.
Figure 14. Proportion of masters for the subject and cooperation prompts
b. Most common skill mastery patterns
Fine-grained skill mastery profiles were thought to provide more specific
information about underlying performance differences across the two prompts. Figures
15 and 16 present the most common skills mastery patterns in the different numbers of
skills that students mastered. The high mastery probability of MCH on the cooperation
prompt suggested in Figure 14 was clearly manifested in the specific skills mastery
patterns of “00001”, “01001”, “11001”, “10111”, and “11111” in Figure 16, indicating
0
10
20
30
40
50
60
70
80
CON ORG GRM VOC MCH
Pro
po
rtio
n o
f M
aste
rs (
%)
Skills
Subject Cooperation
![Page 182: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/182.jpg)
170
that MCH is the most preliminary skill and must be mastered prior to other skills on that
prompt.
Figure 15. The most common skill mastery patterns for the subject prompt
Figure 16. The most common skill mastery patterns for the cooperation prompt
c. Number of mastered skills
The extent to which students are likely to master the same number of skills
across different prompts was also expected to provide valuable insights into the
diagnostic capacity of the model. If a significant discrepancy were found across the
prompts, the robustness of the diagnostic model would be suspect. Figure 17 compares
the proportion of masters across the different numbers of skills that students mastered on
13.75
5.83 5.42 3.75
5.00
14.58
0
5
10
15
20
25
00000 01000 01001 00111 11101 11111
Pro
po
rtio
n o
f S
tud
ents
(%
)
Skill Mastery Pattern
13.33
8.75
5.00 5.00 5.00
22.92
0
5
10
15
20
25
00000 00001 01001 11001 10111 11111
Pro
po
rtio
n o
f S
tud
ents
(%
)
Skill Mastery Pattern
![Page 183: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/183.jpg)
171
the two prompts. Although almost the same proportion of students mastered zero, one,
three, or four skills, a notable difference was found in the proportion of students who
mastered two or five skills. The mastery probability for five skills was considerably high
for the cooperation prompt, whereas the mastery probability for two skills was high for
the subject prompt. This performance difference suggests that the diagnostic function of
the model must be carefully reexamined.
Figure 17. Number of mastered skills for the subject and cooperation prompts
Skill mastery profiles across different proficiency levels
a. Overall proportion of skill masters
Student diagnostic skill profiles were examined in order to focus on how skills
mastery profiles differ across different writing proficiency levels. A diagnostically well-
constructed model was assumed to produce skill profiles that had distinctively different
characteristics across different proficiency levels. The 480 students were divided into
three proficiency groups: beginner, intermediate, and advanced. The beginner group was
composed of those whose TOEFL independent writing scores ranged from 1 to 2.5
(n=103), the intermediate group‟s scores ranged from 3 to 3.5 (n=205), and the advanced
group‟s scores ranged from 4 to 5 (n=172). Figure 18 shows the proportion of skill
masters across different proficiency levels, and indicates that the estimated diagnostic
model differentiates substantially among students at different writing proficiency levels.
0
5
10
15
20
25
30
0 skill 1 skill 2 skills 3 skills 4 skills 5 skills
Pro
po
rtio
n o
f S
tud
ents
(%
)
Number of Mastered Skills
Subject Cooperation
![Page 184: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/184.jpg)
172
The skill mastery probabilities for the intermediate and advanced groups were decidedly
higher than those for the beginner group across the five skills.
Figure 18. Proportion of masters across different proficiency levels
Notably, the beginner and intermediate groups showed a very similar skills
mastery pattern that was distinctively different from that of the advanced group. These
two groups had a higher proportion of masters for ORG (34.95% for the beginner group
and 54.63% for the intermediate group) and MCH (29.13% for the beginner group and
60.00% for the intermediate group), and a lower proportion of masters for GRM (14.56%
for the beginner group and 40.00% for the intermediate group) and VOC (11.65% for the
beginner group and 32.20% for the intermediate group). Conversely, a substantially
higher proportion of students in the advanced group mastered GRM (87.79%) and VOC
(72.67%). These results were consistent with the skill difficulty discussed in the previous
analyses. Considering that VOC was the most difficult skill, it was reasonable that more
proficient students showed a higher probability of mastery for this skill.
b. Proportion of skill masters across different essay prompts
It was also worthwhile to examine whether the three groups maintained
distinctive skill mastery patterns without being affected by the prompt effect. The 240
students in each subject and cooperation prompt group were further divided into beginner,
intermediate, and advanced groups. The subject prompt group consisted of 54 beginner,
0
20
40
60
80
100
CON ORG GRM VOC MCH
Pro
po
rtio
n o
f M
aste
rs (
%)
Skills
Beginner Intermediate Advanced
![Page 185: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/185.jpg)
173
105 intermediate, and 81 advanced students, and the cooperation prompt group consisted
of 49 beginner, 100 intermediate, and 91 advanced students. Figures 19 and 20 compare
the skill mastery patterns for the three groups across the two essay prompts. The overall
skill mastery patterns for the beginner and advanced groups did not differ across the two
prompts: although a slightly higher proportion of students in the cooperation group
mastered MCH (32.65% for the beginner group and 81.32% for the advanced group) than
did students in the subject group (25.93% for the beginner group and 77.78% for the
advanced group), the general skill mastery patterns did not differ significantly. However,
a drastically different skill mastery pattern was observed in the intermediate group,
where the mastery rate for MCH increased remarkably (49.52% for the subject prompt
and 71.00% for the cooperation prompt) and the mastery rate for ORG slightly decreased
(59.05% for the subject prompt and 50.00% for the cooperation prompt) on the
cooperation prompt. The intermediate group‟s high proportion of MCH mastery on the
cooperation prompt might have contributed to the overall high mastery probability for
this skill, as shown in Figure 14.
Figure 19. Proportion of masters across different proficiency levels for the subject
prompt
0
20
40
60
80
100
CON ORG GRM VOC MCH
Pro
po
rtio
n o
f M
aste
rs (
%)
Skills
Beginner Intermediate Advanced
![Page 186: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/186.jpg)
174
Figure 20. Proportion of masters across different proficiency levels for the cooperation
prompt
c. Number of mastered skills
The association between the number of mastered skills and writing proficiency
levels was also examined, with a positive correlation between the two variables assumed
indicative of a good diagnostic model. As Figure 21 shows, the beginner group exhibited
a constantly decreasing proportion of masters as the number of mastered skills increased.
Although there was a reversing pattern for mastery of four (8.74%) and five (10.68%)
skills, a negative association between the proportion of masters and the number of
mastered skills was apparent. The distribution of masters in the intermediate group
showed a typical bell-curve shape in which most students mastered two or three skills.
The advanced group was somewhat similar to the intermediate group in that most
students mastered two or three skills, but the advanced students showed a markedly
higher proportion of masters of four or five skills than those in the intermediate group.
The general association between the number of mastered skills and writing proficiency
levels thus supports the diagnostic power of the estimated model.
0
20
40
60
80
100
CON ORG GRM VOC MCH
Pro
po
rtio
n o
f M
aste
rs (
%)
Skills
Beginner Intermediate Advanced
![Page 187: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/187.jpg)
175
Figure 21. Number of mastered skills across different proficiency levels
Case analysis
A case analysis was conducted in order to further examine the quality of the
estimated skill profiles. Six cases were selected whose skill profiles were drastically
different in spite of similar observed scores. Table 31 presents background information
and skill profiles for these cases. The selected cases consisted of four male and two
female students who spoke a variety of native languages. They were awarded similar
observed scores ranging from 21 to 25, but mastered different numbers of skills.
Table 31
Case Profiles
Case
ID Age Gender
Native
language
Observed
score
ETS
score
No. of
mastered
skills
Skill
profile
Undetermined
skill
2207 16 Female Korean 21 3 0 00000 CON, ORG
1092 23 Male Japanese 21 3 1 00001 ORG
1086 33 Female Turkish 21 3 2 01001 None
1133 18 Male Korean 21 3.5 3 01101 None
2178 34 Male Japanese 22 2 4 01111 None
2139 38 Male Spanish 25 4 5 11111 None
Descriptive analysis indicated that while there was a moderately positive
association between observed scores and ETS scores, it was difficult to identify any
0
5
10
15
20
25
30
0 skill 1 skill 2 skills 3 skills 4 skills 5 skills
Pro
po
rtio
n o
f S
tud
ents
(%
)
Number of Mastered Skills
Beginner Intermediate Advanced
![Page 188: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/188.jpg)
176
relationship with the skill profiles estimated in the diagnostic model. For example, the
observed score difference between Case 2207 and Case 2139 was only 4 points, but Case
2207 mastered no skills (with CON and ORG undetermined), while Case 2139 mastered
all five skills. Similarly, Case 1092 and Case 1133 had the same observed score of 21,
but their skill profiles drastically differed with regard to “00001” and “01101”.
This discrepancy raises more questions than it answers. Despite the student‟s
attempt to master CON and ORG, Case 2207 was still considered a non-master of all
skills. Conversely, Cases 1092 and 1133 were considered masters of at least one skill. If
a score report describing the discrepancy between total observed score and skill profile
was given to students, it is questionable whether it would be useful for student learning,
since students could be confused or have different interpretations of their writing skills
proficiency.
However, it is also possible that this discrepancy can be interpreted as
highlighting the need for diagnostic skill profiles. The case analysis clearly demonstrated
that students with the same observed scores did not necessarily have the same skill
profiles. Indeed, it generated many different skill profiles highlighting the various
strengths and weaknesses of students‟ ESL academic writing. If a single observed score
was provided to students, it could not really inform them about their writing strengths
and weaknesses because it masks fine-grained, specific diagnostic information. Care
should therefore be taken when a diagnostic score report is created and given to students
who exhibit strikingly different observed scores and estimated skill profiles.
Correlation between EDD Scores and TOEFL Scores
The observed scores awarded by ESL teachers using the EDD checklist were
correlated with the original TOEFL iBT independent writing scores awarded by ETS
raters for the 480 essays. The correlations between the two sets of scores were moderate,
indicating r = .61, p < .01 for the subject prompt and r = .70, p < .01 for the cooperation
prompt. The overall correlation for the 480 essays was also moderate, with r = .66, p
< .01. As was the case with the results discussed in Chapter 5, this moderate correlation
might indicate that, to some extent, the EDD checklist measures the same writing
construct that the TOEFL iBT independent writing rating scale measures. However, it is
![Page 189: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/189.jpg)
177
also possible that the two measures tap different areas of writing construct because the
magnitude of the correlation was not substantially strong. Further evidence is needed to
support or reject the idea that the two measures yield convergent results.
Teacher Perceptions and Evaluations
Teacher Questionnaire Responses
Teachers‟ responses to the questionnaire were descriptively analyzed and
focused on their evaluations of the EDD checklist. Reactions were generally positive,
with no extremes. When asked about their overall satisfaction with using the EDD
checklist in essay assessment, one teacher reported that she liked the checklist “a little
bit”, three teachers liked it “quite a lot”, four liked it “very much” and two liked it
“extremely”. With regard to the descriptors, two teachers reported that they were “quite
clear”, six said they were “very clear” and two said they were “extremely clear”.
When redundancy was examined, eight teachers felt that the descriptors were
“not redundant” and two felt they were “a little bit redundant”. The teachers thought
highly of the usefulness of the descriptors: two considered the checklist “quite useful” six
thought it was “very useful”, and two thought it was “extremely useful”. The checklist‟s
comprehensiveness and relevance to ESL academic writing were also perceived
positively. Only two teachers reported that the checklist was “a little bit comprehensive”
or “quite comprehensive” in capturing all instances of ESL academic writing, and the
remaining eight teachers reported it to be either “very comprehensive” or “extremely
comprehensive”. A similar pattern was observed with regard to the perceived relevance
of the descriptors for ESL academic writing: eight teachers reported that the EDD
descriptors were “very relevant”, and two said they were “extremely relevant”.
Teachers‟ reactions to the binary system used in the checklist were somewhat
heterogeneous. Four teachers reported that the EDD checklist was “a little bit conducive”
or “quite conducive” to making a binary choice, while the remaining six reported it was
“very conducive” or “extremely conducive”. When asked the the number of times they
read the given essays when marking them, five teachers said “twice”, three said “three
times”, and two said “more than three times”. When asked about most or least important
descriptors in developing ESL academic writing, most of the teachers felt that the
![Page 190: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/190.jpg)
178
descriptors related to content fulfillment (D01-D08) and organizational effectiveness
(D09-D14) were most important. A substantial consensus was not reached among the
teachers with regard to the least important descriptor, so that result could not be reported.
Teacher Interviews
The teachers‟ reactions to the EDD checklist were explored in greater detail,
through analysis of their interview data collected during the pilot and main studies. This
reading of the transcripts identified a variety of evaluation themes, including the EDD
checklist‟s strengths and weakness and its diagnostic usefulness for classroom instruction
and assessment. It also showed how teachers‟ perceptions of the EDD checklist changed
over time.
What Are the Strengths of the EDD Checklist?
The checklist‟s comprehensiveness was considered obvious great strength. As
the interview excerpts illustrate, several teachers acknowledged that the checklist
covered multiple aspects of ESL writing:
Researcher
Angelina
Tom
What do you think about the strength of the EDD checklist?
I would say it‟s very comprehensive; it takes a lot of different
aspects into account when it comes to student writing and
assessing student writing. I think that is a strong point.
Absolutely, there is no question about that in my mind that the
amount of information, feedback, and specifics of the feedback are
very positive things for a feedback model. Absolutely, I have
nothing to say about the way you have delineated things.
The checklist‟s fine-grained specific writing subskills also appeared to
successfully guide the teachers‟ assessment. Both Brad and Esther commented that the
breakdown of writing skills helped them know what to look for while assessing essays:
Brad
Esther
I like it because it gives you a regular guideline that when you get
an essay you need to look out for these things. It is useful for that;
otherwise I may forget to check sections.
I thought that this is becoming so obvious because each one has a
glaring thing that is coming out when you look at the list. You can
see a lot of things are ok, but this one is missing the question, etc.
What the essential element is does begin to emerge more clearly
having done a number of them. I think that when using them in a
repeated sense, the problem area does start to pop.
![Page 191: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/191.jpg)
179
Esther commented that once she had internalized the checklist, the evaluation
criteria naturally emerged while she was marking essays, making the rating process go
smoothly (particularly with regard to basic descriptors such as capitalization):
Researcher
Esther
What do you think about the rating scale? Do you like it?
The yes/no scale? I do. The more I used it, the more I liked it. You
had said in your directions to internalize it. Which, of course, you
don‟t do after the first reading. You really only start to internalize
after using it a few times. So after using it a few times, the
taxonomy that you used, those items seemed to jump out of the
paper more easily. Yes, some of the descriptors I really liked and
found so easy to use. I actually really liked some of the basic
descriptors, such as “are there capitals?” It‟s so easy. You know if
they are there or if they are not there. It is internalized really
quickly. Even things like the prepositions, even if the writing was
great, but the prepositions were off.
A similar view was found in Erin‟s interview. She also found that the descriptors
assessing mechanical linguistic knowledge were easy to use, confirming the findings of
overall teacher confidence.
Researcher
Erin
Was the EDD checklist easy to use?
Sometimes it was. „Spelling,‟ „punctuation,‟ „capital letters‟ are
easy to be confident as a clear yes or no. „Indentation,‟ etc,
„articles,‟ some of this grammatical stuff also was quite easy. But,
again, there were some areas that are a bit challenging. Yes, like
“the paragraph is connected to the rest of the essay.” Statements
like that are hard – it depends how you see “connected.” Are you
reading it through logic of ideas, or is it vocabulary connection or
transitional features? Stuff like that. Yes, but I thought it was quite
good.
In addition, the teachers reported that the EDD checklist made themselves more
internally consistent raters. Brad commented that the checklist reduced the randomness in
his ratings.
Brad
What I like about this, I think my marking previously is much more
erratic. It depends a lot on my personal feelings that day – if I‟m in
a good mood, my students will probably do better. If I‟m in a bad
mood, my students will probably do worse and I‟ll notice the
errors. This at least provides some consistency and dampens down
the effects.
![Page 192: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/192.jpg)
180
Greg also said that the unique nature of the checklist helped him to maintain consistency:
Researcher
Greg
How about your consistency?
I thought it was very consistent. I thought there were only maybe 3
times out of 24 essays 35 descriptors when I could say, “Hmm,
I‟m not sure.” In that case, I wrote a little note so that you could
think about it. I thought that with this it was very easy to be
consistent.
In a heated debate, the teachers exhibited drastically different ideas with regard
to the effectiveness of the binary choice system. Angelina expressed her concern about
the lack of a continuum on which writing performance could be measured. She was
dubious about what a yes or no could indicate about writing competence. As she rightly
pointed out, yes does not imply absolute mastery of certain writing skills, and no does not
imply absolute non-mastery:
Researcher
Angelina
Was the yes or no system easy to use?
I think what made this so difficult is because there is no
continuum. There is no medium. What does this mean exactly? Is
this student competent in writing academically? Or not? There is
no continuum. That is why I found that difficult. I think I would
be. I think if there was a scale it would work. I think so. It was
difficult because it was either a yes or a no.
Brad took a similar view, raising the issue of lack of a scale. He reported that
having to make a binary choice increased his psychological load because dichotomizing
language competence into yes or no rendered a huge weight. Brad suggested that a scale
would be psychologically more relaxing and would make it easier to make judgments
quickly; however, he also admitted that the binary option actually forced him to reread an
essay and deliberate more on its quality, which he felt was much fairer than reading an
essay only once, as he would if assessing on a scale. He even speculated that his criticism
of the yes or a no system could be based on his own lack of confidence. Indeed, Brad had
the lowest confidence level of the seven teachers who participated in the pilot study:
Researcher
Brad
What made you so not sure about your decision?
I like to think there is a little bit of flexibility. Then, the student
can see what they did, that they weren‟t all no or yes. I think
because on some of them, if you have that 1-2-3-4 psychologically
with a teacher, then they feel a bit more relaxing. Like if it‟s, okay,
I‟m not quite sure, but I think this is a 2 or 3 instead of making a
definite 2 or 3. For a teacher, it wasn‟t such a convenient system. I
![Page 193: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/193.jpg)
181
Researcher
Brad
Researcher
Brad
Researcher
Brad
found the whole yes and no, I didn‟t particularly enjoy that. I think
it would‟ve been better to have a little bit more of a grading, even
1-5 or 1-4.
So, you still think that you might have felt more comfortable with
a 4-point or 6-point rating scale?
Yes, definitely. I think for teacher and me as well it makes it
easier. If you just say yes or no, it‟s a huge weight. For me, it‟s
easier to put a 4 instead of 5, or a 2 instead of a no. Maybe it had
more due to about my confidence style as a teacher. I don‟t know.
Then, if you had used a scale, do you think it would have taken
less time?
I think I would spend less time with the scale. You would still
have yes or no, but you can break it up slightly. Yes and no is
really a pass or fail. But, with the scale a no could be a 1 or 2 and
when I put a 2, I would feel less like I was really saying no. I
guess it‟s psychological really. But, it‟s quicker to think around a
1, 2, and 4. But here, sometimes I would deliberate for a long time.
There are times that it‟s borderline and I reread it and speculate if
it‟s yes or no. I think I would probably do it faster with a scale.
But, maybe because it‟s yes or no I‟m going back more often to
read the essay. So, maybe it‟s fairer. If I was doing it with the
scale I might read it one time only.
Then, how many levels do you think would be appropriate to
create a scale?
5-point is not good because I could put a 3 too many times. If it is
a 4-point scale it sort of makes me come down on either side. I
know it‟s a 2 or 3 and I‟m more relaxed to put a 2 or a 3. A more
unsure teacher will still have it sort of work out as a yes or no. But,
if you had let me put a 3 there, I would have been in a lot of
danger. I would have put a lot of 3‟s.
Kara‟s stance on the binary system was slightly different. She took a middle of
the road position by noting that a yes or a no option was a fine system although it
required a little practice. She also noted that raters tend to sit on a fence:
Researcher
Kara
What do you think about the binary choice, yes or no? Was it easy
to assess the essays with this system?
I think you need to get your head into the right way of thinking
about it, right? I mean there was a couple times where I would
think about it and go back and change my mind. I would mark it a
yes and then later when I got to another one and marked it a no, I
would go back and change the other one. You realize that if it‟s a
no here, it‟s a no there, but you need to be consistent. But, once
you get it straight in your mind, it‟s not personal; it‟s not against
the person. Either they have it or they don‟t. Honestly, I think it
takes a little practice, but once you get into that mindset of yes or
![Page 194: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/194.jpg)
182
no, it‟s okay. Even with this it was a little challenging because you
can easily fall into the „not quite‟ category a lot. Then after you
look at it and realize… You start to get a feel for it and you rethink
it and start to think, “Maybe it wasn‟t as clear as it could have
been, or as strong as it could have been, or maybe they didn‟t use
transitions as much as I thought they did.”
By contrast, Esther, Mark, and Greg all felt that the binary choice system was
both reliable and convenient. Contrary to what Brad believed, Esther thought that the
binary system actually increased her rating speed. She also pointed out that even if she
had been given a rating scale, she would have needed to distinguish between the middle
categories, a 2 or a 3:
Researcher
Esther
Researcher
Esther
What do you think about the binary choice? Was it easy?
The yes or no aspect? It was quicker. I like that it was quicker.
Is it user-friendly? Would you say it‟s friendly because it doesn‟t
take much time?
At first I thought that it would be really hard. I know when we
initially talked I say, yes or no, but when I looked at the list I
thought, maybe there should be a Likert. And if the Likert were
simple enough, like a 1-4, it would still be user-friendly. But, you
would still be (humming and hawing) between a 2 or a 3. I didn‟t
mind the yes or no. To be honest, I guess I didn‟t. I thought I
would. But, once I started marking it was okay. In addition, a yes
or no increased my speed. I hate to say it because it‟s so much
easier to say yes or no. I loved that it was a yes or no because it
makes it quick and I have tons of them to do. It‟s a quick thing.
For instance, over 50%, you‟re good. However, I think a 4-point
Likert might help. If there was a way of shrinking it and adding a
scale, it might end up the same. Some of them you would be able
to put a 4, but it is tough. Between a 2 and a 3 is still very tough.
How would you distinguish between a 2 or a 3? You can easily
distinguish between a 1 and a 4. That would be the challenge.
Tom and Greg provided deeper insight into the underlying mechanism of the
binary system. Both reported that they were able to answer a yes or a no question
confidently because the checklist had already broken down writing ability into specific
and distinct subskills. Tom went on to say that examining one aspect of writing at a time
helped him to focus, thereby enabling him to answer the descriptors more consistently:
Researcher
Tom
Or if you were given a 4-point rating scale, would you be more
consistent?
No, because the questions are specific enough. If I had to deal with
![Page 195: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/195.jpg)
183
the „language skill‟ lesson and reduce it to only 3 descriptors, then
I would probably start to say I can‟t go yes or no, as there are too
many variables. Once you delineate the variables like this, it
becomes easy to say yes or no. I think the specificity is great
because it tells people where to exactly focus their studies.
Along the same lines, Greg claimed that a holistic scale would have led him to
make vague assumptions about a writer‟s ability instead of focusing on the essay itself.
He further commented that because the EDD checklist broke down writing ability into 35
concrete descriptors, he did not struggle with uncertainty while rating the essays:
Greg Specifically, I‟m talking about the confidence rating. In every
situation, I can honestly say that my confidence rating is 100%.
Honestly. I have seen so many essays that I feel very confident in
my evaluations. Furthermore, it is your 35 points that make sure
the confidence levels are so high, everything is easily evaluated
individually.
Consider a different situation: If I had two essays that were about
the same overall, say something like...
Essay 1
Good length, good form, some good transitions
Basic vocabulary, accurate spelling and punctuation
But short uninteresting sentences, no flow
Basically a very formulaic essay, maybe even a little repetitive
Essay 2
A little short, with many spelling mistakes, mediocre form
Excellent advanced vocab, accurate collocations, generally easily
readable
Well reasoned and supported with great examples, (in other
words), a very thoughtful essay that probably indicates real
mastery and comfort with the language.
These two essays might, overall, rate about the same. Let‟s say
they both came in about a 4 or a 5 out of 6. Well, here a confidence
measurement might be important. If I give Essay 1 a 5, I might be
hesitant about that and say my confidence level was 50%, because
maybe it should have only rated a 4. The content quality was not as
good as the structural quality.
Similarly, with Essay 2, I might give it a 4 if I‟m grading harshly,
but I would not be confident about that because the overall quality
of the essay might have deserved a better score, even if there were
technical problems. If I gave it a 5, then I‟m making assumptions
about the person‟s ability, and not grading the essay itself. So I
wouldn‟t feel confident about that either. But when you break it
down into 35 parts and say, “This essay shows knowledge of
![Page 196: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/196.jpg)
184
English sentence structure.” Yes or No? Well, it‟s easy to see if
they understand the general format, even when there are occasional
mistakes. Yes or No, in my mind, isn‟t complicated by maybes. My
confidence will always be extremely high.
So.... is it going to be a problem that my confidence levels are so
high? Would you like me to reconsider these?
What are the Weaknesses of the EDD Checklist?
Most teachers considered the checklist‟s length to be a serious problem. Brad
commented that the checklist was a time consuming and ineffective way to assess an
essay:
Researcher
Brad
How did you like the checklist?
I was, I like, in terms of like, I think I found it… it‟s kind of a time
consuming way to look for an essay. So yeah, for that reason, I
found it a little bit ineffective.
Esther also remarked on the length of the checklist, though she admitted that she
was not precise in making yes or no judgments for certain descriptors. If issues with an
evaluation criterion did not emerge automatically, she assumed that a student had met
that criterion:
Esther In general, I would say, “I like the checklist,” but there are areas
that are a bit confusing for me. At first I thought it was too long.
It‟s a lot. I found myself being lazy with some of them because I
thought, “Well, it didn‟t jump out at me, so it is fine.” You might
be losing out in those areas. I‟m fully fessing up to it since you are
testing it, but I‟m letting you know. I thought that was there, but I
didn‟t go back because it was yes most of the time. I didn‟t have to
be that precise, so I did find myself (not remembering). If any
pronouns didn‟t match up with their pronouns, it did jump out.
Tom, on the other hand, thought that the lengthy process was an opportunity to
read an essay more thoroughly. This view is congruent with what teachers reported on
the questionnaire: while the checklist is time consuming, they appreciated its
comprehensiveness:
Tom I found the whole process of 35 questions to be very easy. It‟s a
lengthy process, that‟s the real problem. But, the fact that it‟s
lengthy means that we are taking the time to look at it in a detailed
way.
![Page 197: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/197.jpg)
185
Another problem with the EDD checklist was associated with subjectivity. As
with other existing rating scales, the checklist was not free from the perils of subjective
judgment. Angelina questioned the meanings of the words “sophisticated” and “few”,
pointing out that a short essay is more likely to have few errors than a long essay, and
thus it is extremely difficult to define “few” without taking essay length into account:
Researcher
Angelina
Researcher
Angelina
Researcher
Angelina
How is your general evaluation?
When you start marking it, you see how some parts are not
manageable. Certain words, what is considered “sophisticated”
vocabulary? I think that becomes the question and therefore that
leaves for interpretation and some people would interpret it
differently than I would. So, that‟s what I thought. With “few,” it
implies frequency. But, how much is “few”? That might be where
the confusion and difficulty lies in trying to judge. What is “few,”
what is yes, what is no?
Actually, one teacher tried to count all the errors.
I was actually thinking of that! I think that is a very systematic
kind of approach. I think when you do it mathematically, it almost
becomes very reliable. “It is a mathematical equation and this is
how I use it.” I was thinking about doing that, but there were so
many other factors involved.
Right! The problem with that is it really depends on the length of
an essay. What if a student wrote a long essay?
Exactly! So, it all goes back to the length. In that case, they will be
penalized unfairly. It doesn‟t accurately reflect what the student
has done. I felt that, too.
Esther‟s concern was somewhat different. She claimed that the adjective “clear”
was too vague to determine the quality of a thesis statement. However, despite this
limitation, she acknowledged that the checklist did not include too many subjective
indicators compared to other existing rating scales:
Esther For some of these descriptors, the subjective things I found tough.
Like, the thesis statement, almost everything I marked said, “Yes,
there was a clear thesis statement.” But, “Was it a good thesis
statement?” “Absolutely not.” But, “Do I know whether they agree
or disagree?” “Yes.” But, that isn‟t a thesis. It‟s an answer to a
question, but I put yes because I wasn‟t sure. … But, subjective
indicators weren‟t too many in there compared to other scales.
There weren‟t too many in yours that I was confused with. The
ones that I would have been confused with were like, the thesis.
But, probably because “clear” and “good” are different for me.
But, some of them were okay, such as “sophisticated” and
“advanced” was subjective. But, I did know what you meant by
![Page 198: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/198.jpg)
186
that.
The issue of fairness was raised several times during the interviews. Both Brad
and Angelina expressed their concerns about whether student writing ability can be fairly
measured without taking essay length into consideration. As Angelina noted earlier, Brad
also felt that longer essays are generally judged more harshly because they are more
likely to contain mistakes. He also commented that a well-written essay was sometimes
scored worse than a poorly-written one because an advanced writer‟s risk-taking
strategies resulted in a loss of points stemming from additional mistakes. Along those
same lines, Angelina said that a short essay could be harder to judge because it might not
exhibit enough evidence to meet the evaluation criteria:
Researcher
Brad
Angelina
Researcher
Angelina
What are other weaknesses of this checklist?
One of the weaknesses, (we discussed on the way here), I think
there needs to be something about length of the piece. Is it a
suitable length? Some of the short ones may not have the
mistakes, but the longer ones are judged more harshly as they have
more chance to making a mistake. But, maybe they are more of a
correct length. Also, as some essays were much better than other
essays, but came out worse because they made more mistakes.
However, they are trying harder and trying to use interesting, more
expressive language, but in doing so they lose marks on verb
tense.
I think, for example, if they don‟t employ whatever descriptor is
there. For example, length, if it‟s too short to make a judgment on.
I know that for #14, „transitions,‟ if they didn‟t use it, you asked
us not to give them a mark. But, if they used it once, is it
appropriate or inappropriate? Have they employed „transition
devices‟?
Yes, I know what you mean.
I thought it was easier and I felt more comfortable, but still
difficult when you can‟t see evidence of the descriptor.
An in-depth discussion about fairness occurred in Angelina‟s second interview,
when she correctly argued that the EDD checklist was biased in favour of essays that do
not display risk-taking strategies:
Researcher
Angelina
Do you think the EDD checklist fairly assess student writing
ability?
For the binary choice, I thought, no. I thought it was difficult. I
thought the test-taker was unfairly penalized or rewarded because
they didn‟t employ a certain descriptor. If they didn‟t use a writing
![Page 199: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/199.jpg)
187
device, „collocations,‟ „transition devices,‟ etc. Even verb tenses
were difficult. For example, it says the “verb tenses were used
appropriately.” If it was all in the present tense, then yes, sure, but
if the student was using a variety of difference tenses, like, “When
I was a child…” and used a flashback. You know, some anecdotal
story about the past and they made an error… But I wasn‟t sure.
The issue was that this student didn‟t make any errors, but only
used present tense, whereas this student made errors, but used a
variety of tenses and made errors. Obviously this person gets a no,
but this person gets rewarded by sticking to just basic, present
tense. Even the pronouns in reference, some are just not used. I
thought some test-takers were penalized, whereas others weren‟t
because they just didn‟t employ them. That‟s why I found it
difficult to just say yes or no.
Angelina went on to suggest that differential weighting be placed on the
descriptors, commenting that the relative importance across descriptors is different, so
that even if two essays have the same number of descriptors correct, those two students
may have drastically different writing ability. She also felt that a poor essay could be
awarded a high score if simple, mechanical descriptors (such as punctuation and spelling)
were correct while a good essay might be awarded a low score by getting those
descriptors wrong. This point is directly related to the need for diagnostic skill profiles.
As discussed in the case analysis, student writing skill profiles can be drastically different
despite similar observed scores. This indicates that a single observed score could provide
an inaccurate estimate of a student‟s writing proficiency because it masks specific
information about that student‟s strengths and weaknesses:
Angelina I thought the first few questions were probably the most important.
Of course, grammar and spelling are important. Grammar is all
about trying to make a persuasive argument. However, if the
student made spelling errors and it didn‟t obscure what they were
trying to say – I don‟t think that it is that important. I do think
there is a gradient in terms of these descriptors. I think that
definitely the organization, intro, body, conclusion, supporting
ideas and examples are important. Spelling and punctuation are not
as important. That is what I think. It is interesting because I felt
that some students were getting the same number of yes and no
answers. But, I felt it sort of unfair. Just because they can spell
correctly….I thought, “Oh my.” I think certain descriptors should
be weighted more heavily than others to better distinguish the
writer‟s overall writing confidence.
![Page 200: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/200.jpg)
188
Another limitation to the EDD checklist was reported by Greg and Brad, who
commented that some evaluation substance could not be explained using the EDD
checklist. Greg called this an overall impression, noting that simply meeting all the
criteria in the checklist did not necessarily result in a good essay. This point echoes the
holistic claim that a score for the whole is not equal to the sum of separate scores for the
parts (Goulden, 1992). Greg also commented that the most effective feedback method is
to focus on a student‟s single biggest problem instead of focusing on all of them:
Greg I prefer this to the very general TOEFL rating scale, but I still feel
the need for an overall rating score. As I said in our first meeting,
they can have yes, yes, yes, yes. They do things very well, but it is
still a very basic, simple, boring essay. So, on this paper, it looks
like a strong essay, but if you say the overall score is 4 out of 6 and
they have all yes, then it should be perfect. But, maybe not? There
is still something that this very detailed analysis can‟t get. That is
the overall impression. I really wanted to see one more line – the
overall impression, or overall score. Students will then know that it
is their biggest problem. Most of these students have so many
problems. Of course, they are going to try to learn and fix
everything. But, students ask me all the time how they could make
their essay stronger. It is a matter of the „biggest‟ problems being
different.
Like Greg, Brad advocated the need to evaluate the “general feel” of an essay,
arguing that there must be evaluation criteria, no matter how arbitrary, by which teachers
can express their general impressions of an essay. He also suggested that the EDD
checklist should utilize human input, taking how it is operationalized and statistically
justified into consideration. Greg‟s and Brad‟s comments are congruent with White‟s
(1985) holistic view that writing is a unified and central human activity rather than
segments split into detached activities:
Brad
Researcher
I think it also has to be a general feel. It‟s a little bit arbitrary
maybe, but how a teacher feels the piece is on its own; as separate
from this. Generally, a C or B or A. Then sort of see if the
numbers correlate to the grade. But, I think that the human aspect
needs to be taken in as well. How somebody is reading it. At the
moment we are breaking it down to the nitty-gritty and that‟s kind
of good. It does reflect a lot about the writing style, but I also think
it‟s good to have that human input saying, “Generally this feels
like a C, or generally this feels like a B”.
For example, how can you put a general humanistic factor in a
![Page 201: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/201.jpg)
189
Brad
Researcher
Brad
score?
I don‟t know. I think a lot of it comes through just reading it. If
you read a lot of sample essays put together by some sort of
organization ranking them as „A essays,‟ „B essays‟. So, when
reading it, you can figure out the „C Bracket,‟ „D Bracket‟. A lot
of that comes from your sense from reading it. From my first
impression this will be a C, but let me check the breakdown.
Maybe they work in harmony. It‟s not statistically justified; a lot
of it is just instinct.
Yes, I agree. A few teachers also mentioned that it kind of misses
first impressions of writing.
Yes, we can be too analytical and try to break down every point.
But, at the same time I do think this can be valuable information
for the students to have as well. But, it would be nice to have that
“Generally how do you feel about it?” But, I have no idea how you
put that into your scale.
How Do Teachers Evaluate the Diagnostic Function of the EDD Checklist?
When the ways in which teachers perceived the diagnostic function of the EDD
checklist were examined, Greg reported that the granularity of the EDD descriptors was
still too broad to provide detailed feedback, and suggested that more fine-grained
descriptors should be identified:
Greg For example, with punctuation, there are so many kinds of mistakes
that students would make. I read this and I like your checklist, but I
always try to keep in mind that we are writing this to give students
feedback. It is for the student. So, I always think if we can give
more information, it will help them more. For example, if we just
say, “Uses punctuation well? No”, they will ask, “Why, what is the
problem?” Many students use a comma instead of „and,‟ tick that.
You could tell them to be careful as sometimes they do it right,
sometimes they don‟t. Sometimes people use quotation marks, but
it‟s more advanced. Sometimes, in the essay, the punctuation is
perfect, but they need more. Right? Because they missed some
places.
Greg further cautioned that the amount of feedback provided to students must be
carefully determined; for example, not all feedback should necessarily be provided at
once, and different feedback treatment should be available for students at different
proficiency levels:
Greg Different kinds of feedback are important for different students. So,
if it‟s a very mediocre essay, it has so many problems. So, what do
![Page 202: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/202.jpg)
190
you do? Obviously, you want to be honest. In the class, they don‟t
need to know everything. I tell them to focus on structure, commas,
spelling. However, I don‟t teach tone explicitly. I give lots of
examples as to what is good, but unless it‟s something that really
stands out, I don‟t fix that in the beginning. Also, it takes a long
time to teach vocabulary. If there is a very serious problem, I‟ll
write a note. Information about vocabulary is too much. But, for
the very sophisticated essays, then they will be strong with some of
this, so let‟s fix your punctuation or vocabulary problems. If you
feel the need to use idioms, then let‟s make sure you use it right.
So, not all of this is needed at the same time. When I am talking
about an essay, I usually pick the two biggest problems. I hand
them back the essay, and ask them to make edits and give it back to
me.
This is something for teachers to keep in mind and I think that most
teachers will do this automatically. But, sometimes there are just
too many no answers. This won‟t help the student, no. But, if you
use this and you can see the level of the students and choose
particular problems, then it will be very helpful.
Similarly, Kara cautioned that correcting all errors could frustrate and
demoralize students, even if such correction was ultimately necessary to improve their
writing skills:
Researcher
Kara
How would you use the information that gathered from the EDD
checklist?
… I think this is a good balance from the student‟s point of view.
I‟m sure they would be horrified to see all of that. Perhaps you
give it to them one chunk at a time, but I think ideally it has all the
points that they need to think about. Probably I don‟t give them
enough of this kind of thing. Probably not. This just frustrates the
students and demoralizes them. I don‟t correct everything and I
figure there is a limit to what they can take in. I will try to focus
on one or two of the biggest problems rather than trying to correct
them all. Thinking of it this way is going to get you to where you
want to go.
Greg and Kara‟s points are well supported by existing ESL writing research. For
example, Hughey, Wormuth, Hartfiel, and Jacobs (1983) argued that “since an attempt to
teach all of them [the intricacies of ESL writing], along with the other important
processes of writing would overwhelm and discourage writers, ESL teachers need to
emphasize the structures that most affect ESL writers‟ abilities to communicate
effectively in written English” (p. 121). However, the opposite position must also be
![Page 203: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/203.jpg)
191
noted: Ferris (2003) and Hedgcock and Lefkowitz (1994) noted that learners of ESL
writing appreciate teacher feedback on all aspects of their writing, including content,
organization, grammar, and mechanics.
Ann provided some insightful comments suggesting that useful diagnostic
feedback should acknowledge a learner‟s effort, improvement, and progression. She
emphasized the importance of teacher comments and encouragement:
Researcher
Ann
Do you think that EDD checklist provides useful diagnostic
information about the strengths and weakness of students‟ ESL
academic writing?
Yes, but not totally – sometimes things are not as simple as a yes
or no. Also, acknowledgement is needed for effort, improvement,
progress, validation etc. Teacher comments are possibly more
important. Encouragement is a huge motivator.
How Do Teachers Perceive the Diagnostic Function of the EDD Checklist for Classroom
Instruction and Assessment?
Both positive and negative reactions were reported by teachers with regard to the
usefulness of the checklist as a diagnostic classroom assessment tool. Tom thought that
having this sort of diagnostic assessment tool would be a great benefit for teachers at
both the class and individual levels. He even suggested that a class should be designed to
focus not only on the areas in which students experience difficulty, but the areas in which
they require individual attention. He also pointed out that the ability to provide specific
diagnostic information would be particularly useful for motivated students:
Researcher
Tom
Do you think the EDD checklist provides useful diagnostic
information?
Yes, I would think so. Just thinking back to my writing classes, I
found one of the things that were difficult to teach was specific
exercises for individuals, where one person would need a lot of
help on spelling, another would need help on punctuation. A
German student would litter their essay with commas and it‟s
difficult to break them of this habit due to the similarity of
structure. I really had to work with each student individually on
that. That took a lot of one-on-one style teaching for this. Teacher
to student as opposed to teacher to students. I found that this sort
of teaching was almost impossible when it came to these types of
details.
But, if I had a list for every student like this, then I could somehow
put the information (the data) in a computer and could see a bright
![Page 204: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/204.jpg)
192
Researcher
Tom
Researcher
Tom
green line down the page by each student‟s name. That way I
could see that almost all of the students had difficulty with this
particular topic and I could plan a study for 15 minutes in the
class, “just on the use of commas.” After that we could activate the
exercise into something more free following a regular ESL lesson
plan, right? I think this would definitely be useful because the
information in here would be evidence that each student would
benefit from that kind of a study lesson.
That‟s a good point! Glad to hear that.
A specific study lesson to focus on before doing the general
activity. Having some kind of reference to provide a common
denominator would be wonderful.
Absolutely!
For motivated students it‟s fantastic. People are always looking for
feedback, particularly adults. This is what we are talking about –
adults. They want yes or no; this is the way we work. Yes or no? If
it‟s no, go work on it.
A slightly different view was provided by Greg, who cautioned that the vast
amount of diagnostic information would be ineffective for less proficient students, and
that the diagnostic information should be used with care because student performance
does not remain static:
Researcher
Greg
Researcher
Greg
Researcher
Greg
Researcher
Greg
What do you think about using this checklist for classroom-based
instruction?
It would be okay, it would be okay. It‟s not my style, that‟s why I
hesitate. I much prefer to write comments. This is very good
feedback, it‟s very organized. It will help the students, it will. But,
for me, the comments here, “Work on it!” This is the best thing;
this is the good record.
Do you think that providing this much information to every
student might be overwhelming?
I think so, especially for low-level students. On a daily or weekly
basis, it might be too much. But, sometimes on a Saturday, once a
month, my school has a practice TOEFL exam. In this case, if they
were really trying to concentrate on writing a proper essay, then I
would focus on giving them a lot of very detailed feedback.
So, on a monthly basis?
I think so. I think so for all of it. But, on a more frequent basis, it‟s
up to the teacher. It‟s up to the teacher, the level, what you discuss
in class. If you are concentrating on form, maybe you want to give
this part (point out grammar descriptors).
So, you wouldn‟t use this for a daily basis or weekly basis
instruction.
I don‟t think so because the same student can write a very good
![Page 205: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/205.jpg)
193
essay one day and the next day a terrible essay. If I say they are
disorganized, weak thesis, bad example, it‟s just one time and they
probably know.
The need for longitudinal feedback was also echoed in the interview of Kara.
She felt that students could misinterpret the diagnostic information derived from a single
assessment, and expressed her concern about the potential for misrepresentation of the
assessment results in a classroom context:
Researcher
Kara
Researcher
Kara
Researcher
Kara
Researcher
Do you think the diagnostic information generated by the EDD
checklist would be useful for classroom instruction?
I think it may be too limited. I think it is in improvement often
when we mark we don‟t look at all of these points. It is very useful
in this perspective. I would be reluctant to give a student this not
wanting them to think that based on one thing they can do these
things. I would think that is an unfair assumption based on this
data.
This is an assessment right?
You‟re right.
You think this might be unfair in a classroom context.
Yes, in a standardized context, yes, then the framework is a bit
different. In a classroom context, which I‟m most familiar with, I
prefer more of a gentle (not as bold) wording.
This is kind of final.
Esther and Brad also commented that a diagnostic report should be carefully
designed to take the fact that students do not make certain mistakes all the time into
consideration:
Esther
Brad
Since this is meant to be a feedback form, it‟s hard to know
whether someone is doing it all the time.
I think it gives me a little bit of a safety net. That way I‟m not
saying “I know this”. It‟s a bit to cover myself. Sometimes I think
on a lot of these essays it‟s not as though they are making the same
mistake all the time, right? They maybe make the odd mistake and
then they will do it correctly. So, to suddenly tick no that spelling
is bad or they don‟t use their verb tenses correctly, a part of me
thinks that this will discourage the student a little bit, seeing that
written there.
How Do the Teachers‟ Perceptions of the EDD Checklist Change Over Time?
Teachers reported that they were more confident the second time they used the
checklist. In her second interview, Angelina commented that she fully understood that
![Page 206: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/206.jpg)
194
each descriptor was independent of the others, which lessened the psychological pressure
she felt when providing a no response to students:
Researcher
Angelina
How did you like the EDD checklist this time?
I did still find it difficult, but less so maybe than the first time
around. I tried to step back a little bit. I think initially I was really
worried that “it‟s not fair” and that was weighing heavily on me,
but I think the second time around it was a little bit easier. I sort of
have an understanding of where the focus is for each one, so I
don‟t feel as guilty when I put no. I know that just because they
didn‟t satisfy this point, they can still satisfy this one, as they are
now independent from one another. There was so much overlap
before it wasn‟t fair because if I put yes in one spot and they
would get all yes, or if I put no, then it was no all the way down.
This time I felt less pressure because maybe it was a no, but the
next one is independent so I‟ll put a yes there.
Similarly, Sarah was more confident in her ratings in the second assessment
round as she became more familiar with the checklist:
Sarah The first one I did I was unsure. I think I was a bit weaker in the
percentages. By the second one I got more convinced. I was more
familiar with the tool and was more confident that “yes, I‟m seeing
that.” I did feel that it made me focus on specific features of the
writing that I maybe didn‟t focus on exclusively before looking at
it from a holistic perspective.
Overall, teacher responses in the interview suggested that the EDD checklist had
both strengths and weaknesses with regard to assessing student writing performance.
While teachers approved of the comprehensiveness and specificity of the descriptors,
they also pointed out certain problems with subjectivity and fairness. The length of the
checklist and lack of human input were also seen as weaknesses. These weaknesses could
be counter-arguments for the use of the EDD checklist. Teachers had mixed opinions
about the use of the binary choice system: some criticized the lack of a continuum on
which writing performance could be measured, while others acknowledged that
answering yes or no to specific, fine-grained descriptors helped them to be more
internally consistent and focused as raters. While teachers acknowledged that the use of
the checklist could have a positive impact on classroom instruction, they cautioned that
an appropriate amount of diagnostic feedback should be given to students when the
checklist is used because an overwhelming amount of negative feedback (as indicated by
an excessive number of no responses) could frustrate and demoralize them. Teachers also
![Page 207: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/207.jpg)
195
highlighted the need for a longitudinal feedback provision in order to accurately estimate
the progress of student writing performance over time.
Summary
This chapter has discussed three validity assumptions centered on the primary
evaluation of the EDD checklist. Of particular importance was the extent to which
writing skill profiles generated using the EDD checklist provided useful and sufficient
diagnostic information about students‟ strengths and weaknesses in ESL academic
writing. The study‟s overall findings suggested that the estimated diagnostic model is
stable and reliable; although approximately 34% of the descriptors exhibited poor
diagnostic power, the estimated diagnostic model had high discriminant function, with
students in the flat categories accounting for only a portion of the total. The moderate to
slightly high correlation between EDD scores and TOEFL scores also provided
convergent evidence for use of the EDD checklist; however, as discussed in Chapter 5,
this criterion-related validity claim should be interpreted carefully because the two rating
rubrics were developed for different test purposes. Overall teacher evaluation further
justified the validity claims for the use of the checklist. While teachers cautioned that an
appropriate amount of diagnostic feedback should be given to students, they also
acknowledged that the use of the checklist could have a positive impact on classroom
instruction. The next chapter synthesizes the research findings derived from a variety of
assumptions to create a validity narrative.
![Page 208: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/208.jpg)
196
CHAPTER 7
SYNTHESIS
Introduction
This study sought to make multiple validity inferences in order to argue that
scores derived from the EDD checklist can be used to diagnose the domain of writing
skills required in an ESL academic context. Each inference in the interpretive argument
prompted a particular investigation of the checklist‟s development and evaluation
procedures. Underlying inferences were investigated by judging the following
assumptions addressing different aspects of validity claims and requiring different types
of evidence:
The empirically-derived diagnostic descriptors that make up the EDD checklist
are relevant to the construct of ESL academic writing.
The scores derived from the EDD checklist are generalizable across different
teachers and essay prompts.
Performance on the EDD checklist is related to performance on other measures
of ESL academic writing.
The EDD checklist provides a useful diagnostic skill profile for ESL academic
writing.
The EDD checklist helps teachers make appropriate diagnostic decisions and has
the potential to positively impact teaching and learning ESL academic writing.
This chapter synthesizes the inferences in order to create a validity narrative that
captures the evolving evaluations and interpretations of the checklist‟s use. The five
validity assumptions that formed the central research questions are revisited and critically
reevaluated in relation to the overarching validity argument leading to the potential
consequences. The empirical data and theoretical analyses that served as the backing for
inferences of the interpretive argument are also discussed in light of evidentiary
reasoning. Finally, the implications of future research on the checklist‟s applications in
ESL academic writing are discussed.
Validity Assumptions Revisited
Validity Assumption 1: The empirically-derived diagnostic descriptors that make up the
EDD checklist are relevant to the construct of ESL academic writing.
![Page 209: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/209.jpg)
197
The central focus of skills diagnostic assessment is the extent to which the skills
being assessed reflect knowledge, processes, and strategies consistent with the test
construct in the target domain. It was thus critical to empirically identify assessment
criteria operationalizing the ESL writing skills required in an academic context.
Theoretical analysis was then used to justify and confirm these assessment criteria.
Considering that the construct of ESL writing is multi-faceted and complicated, it was
also important to identify fine-grained and separable assessment criteria so that it would
be possible to implement skills diagnosis modeling. If the ESL writing construct could be
reliably and validly deconstructed and operationalized, then valid inferences could be
made about students‟ ESL writing ability.
To this end, multiple empirical sources were sought from diverse perspectives.
Not only was the incidence of writing performance observed using real student writing
samples, but assessment criteria were elicited from teachers‟ think-aloud verbalization on
ESL essays. As discussed in Chapter 4, these verbal accounts provided rich descriptions
of ESL academic writing subskills and textual features, resulting in 39 descriptors. These
descriptors were empirically-derived, concrete, fine-grained, and consistent with
theoretical accounts of ESL academic writing, addressing all aspects of writing skills (i.e.,
content fulfillment, organizational effectiveness, grammatical knowledge, vocabulary use,
and mechanics). The substantive review and refinement process performed by the ESL
academic writing experts further confirmed the soundness of the descriptors, resulting in
the final 35 descriptors that make up the EDD checklist. The greatest number of
descriptors associated with grammatical knowledge was also reasonable considering that
students greatly desire feedback on grammatical problems in their writing (Cohen &
Cavalcanti, 1990; Ferris, 1995; Hedgcock & Lefkowitz, 1994; Leki, 1991).
A series of exploratory and confirmatory statistical analyses were used to further
characterize the latent dimensional structure of ESL academic writing. Various facets of
writing ability were conceptualized and organized, suggesting that writing competence
does not lie on a single unitary continuum. These findings were consistent with
theoretical accounts of ESL writing, defining writing ability as a constellation of multiple
subskills. As Biber (1988) noted:
![Page 210: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/210.jpg)
198
Linguistic variation in any language is too complex to be analyzed in terms of
any single dimension. The simple fact that such a large number of distinctions
have been proposed by researchers indicates that no single dimension is adequate
in itself. In addition to the distinctions… such as restricted versus elaborated and
formal versus informal, linguistic features vary across age, sex, social class,
occupation, social role, politeness, purpose, topic etc. From a theoretical point of
view, we thus have every expectation that the description of linguistic variation
in a given language will be multidimensional. (p. 22)
This multidimensional view of ESL writing was also consistent with discussions in ESL
writing literature. As noted in Chapter 2, despite different orientations, theoretical
accounts, discourse analysis, and rater perceptions and rating scales provided compelling
bases upon which to define and assess ESL writing abilities.
Of particular interest were the ways in which the EDD checklist differed from
other assessment methods. Although it is similar to most analytic rating scales, the
checklist was able to overcome a number of the limitations of those other scales. For
example, while the checklist conceptualizes ESL academic writing competence in the
same way as most other analytic rating scales (by focusing on such major assessment
criteria as content fulfillment, organizational effectiveness, grammatical knowledge,
vocabulary use, and mechanics), the fine-grained descriptors in the checklist assess
specific writing features that maximize the diagnostic feedback from which students can
benefit. In order to assess grammatical knowledge in an essay, for instance, the checklist
can provide a precise description of the global and local grammatical aspects associated
with syntactic structure, errors of agreement, tense, number, articles, pronouns,
prepositions, and so on.
The evidence gathered throughout the EDD checklist‟s development procedure
suggests that the checklist accurately represents the multidimensional construct of ESL
academic writing. The teachers‟ think-aloud verbal data were a valuable empirical source
that substantiated the construct being measured and provided concrete rationales and
evidence justifying the selected assessment criteria. The theoretical analysis further
confirmed that the checklist was neither atheoretical nor free of theory. This approach
was particularly well-aligned with the concepts of diagnostic assessment because it
enabled teachers to be active generators of assessment criteria and interpreters of
assessment outcomes rather than passive listeners. In a diagnostic assessment framework,
an ongoing dialogue with assessment users and developers can help to create a consensus
![Page 211: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/211.jpg)
199
about the elements to be evaluated, and can help to keep diverse educational clientele
better informed about the assessment outcomes.
Validity Assumption 2: The scores derived from the EDD checklist are generalizable
across different teachers and essay prompts.
The second assumption examined the potential impact of the various sources of
variability associated with sampling conditions of observation. Teacher and essay prompt
facets were the primary sources of variability suspected to prevent accurate inferences
about student ESL academic writing ability. If the student writing scores obtained from a
sample of teachers on a sample of essay prompts cannot be generalized beyond that
specific set of teachers and essay prompts, it will undermine the interpretive argument.
Three approaches were used to explore this suspected variability: (a) teacher internal
consistency, (b) teacher agreement, and (c) descriptor-teacher/essay prompt interaction.
In a Many-faceted Rasch Model (MFRM) analysis, the teacher fit statistics
indicated that all of the teachers exhibited accurate rating patterns when using the EDD
checklist and none exhibited random, halo/central, or extreme rating effects. A bias
analysis further suggested that most teachers were neither positively nor negatively
biased toward any particular descriptors, nor were the essay prompts biased for or against
any descriptors. These results suggest that teachers are able to use the EDD checklist in
an internally consistent manner and that the EDD descriptors function consistently across
different teachers and essay prompts.
However, a mixed result was found when agreement rates among teachers were
investigated. While correlation between a single rater and the rest of the raters (SR/ROR)
indicated that each teacher might have rank-ordered students in a manner similar to that
of the other teachers, teacher agreement statistics reported that for a somewhat low or
moderate percentage of times, each teacher provided exactly the same ratings as another
teacher under identical circumstances. In addition, when teacher agreement rates were
examined at the descriptor level, teachers showed high agreement (> 85%) on descriptors
assessing discrete grammar knowledge, but low agreement (< 70%) on descriptors
assessing global content skills. These results indicate that it might be difficult to claim
that a particular teacher‟s assessment of student writing performance is generalizable
![Page 212: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/212.jpg)
200
beyond that specific teacher. However, as discussed in Chapter 5, the reported reliability
indices must be interpreted carefully because the teachers were not well-trained certified
professional writing assessment raters, and dichotomous ratings (rather than polytomous
ratings) were used in the assessment. The research findings of Barkaoui (2008) and
Knoch (2007) also support the idea that the subjective nature of the task renders it
difficult to achieve high inter-rater agreement in ESL writing assessment.
Overall, the findings of the second assumption present a somewhat fuzzy picture
of the random errors associated with teachers and essay prompts. Unlike traditional
fixed-response assessments (such as multiple-choice tests), the presence of raters and
tasks in performance assessment adds a new dimension of interaction, making it even
more crucial to monitor reliability and validity. A greater number of raters and tasks in an
assessment would be desirable in order to improve consistency from one performance
sample to another, but this is not always possible due to limited resources. The problem
becomes more serious when one considers that the EDD checklist was developed to be
used for diagnostic assessment purposes in a small-scale classroom, where relatively few
resources are allocated. One way of resolving this problem would be to standardize essay
prompts by providing clear specifications. Another way would be to train teachers on a
continuous basis, since effective training would help teachers to use the checklist
consistently and reliably. Care must be taken, however, because high inter-teacher
reliability could counter the contextual validity gained from using the EDD checklist.
The checklist was developed to be used in classroom assessment, which is typically
provided by just one teacher. High inter-teacher reliability would not be crucial in such
cases, and could even threaten the valid use of the checklist.
Validity Assumption 3: Performance on the EDD checklist is related to performance on
other measures of ESL academic writing.
The third assumption is related to concurrent or criterion-related validity and
examined the extent to which the scores awarded using the EDD checklist correlated
with other measures of ESL academic writing. This assumption did not necessarily seek
convergent evidence among different measures of ESL academic writing because a single
measure should not automatically be the norm against which others are compared. The
![Page 213: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/213.jpg)
201
selected measure was the TOEFL independent rating scale, and the correlation between
the two measures was r = .77 in the pilot study and r = .66 in the main study. This
moderate to slightly strong association indicated that a student who received a high score
in EDD assessment would likely receive a high score in TOEFL assessment. It also
suggested that, to some extent, the EDD checklist measures the same ESL academic
writing construct that the TOEFL rating scale measures.
However, different interpretations are also possible. The fact that the magnitude
of the correlation was not very strong suggests that the two measures approached the
ESL academic writing construct in different ways. As White (1985) rightly argued, the
holistic and analytic rating systems upon which the TOEFL rating scale and EDD
checklist are based rely on fundamentally different philosophies. White considered the
act of writing to be a whole human activity that cannot be broken into separate segments.
Goulden (1992) defined the holistic assessment approach in a similar way; that is, a score
for the whole is not equal to the sum of separate scores for the parts. The moderate
association between the two measures could therefore reflect the fundamental differences
underlying the holistic and analytic rating systems. Indeed, as discussed in Chapter 6,
meeting all of the assessment criteria in the EDD checklist does not necessarily result in
a good essay because the checklist does not explicitly measure such elements as overall
impression.
The purposes for which the two measures were developed must also be taken
into account. The EDD checklist diagnoses students‟ ESL writing ability in an academic
context and guides and monitors their writing progress, while the TOEFL rating scale
places students into appropriate ESL writing proficiency levels in order to facilitate
school admission decisions. Therefore, divergent evidence might be more informative if
it was used to highlight these different assessment purposes. If this holds true, it is
reasonable that the two assessment purposes required different ESL writing abilities and
tapped into different aspects of the ESL writing construct.
Another point that demands attention is the different size of the correlation
coefficients. When the TOEFL scores were correlated with the writing proficiency
measures estimated by a MFRM analysis, a greater association was found than with the
observed scores. This might be because the two sets of scores (i.e., estimated and
![Page 214: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/214.jpg)
202
observed) were derived from fundamentally different measurement theories, such as the
classical test theory and item response theory. The MFRM analysis estimated latent
writing ability free from the severity of a particular teacher and free from the difficulty of
an essay prompt and a descriptor, so that estimated writing measures might have more
accurately reflected students‟ true writing ability than the observed scores might have.
On the other hand, the total observed score did not take such assessment conditions into
account, resulting in possibly biased scores. The failure of the score adjustment might
therefore have caused the lower correlation with observed scores. The different number
of teachers involved in the assessment must also be considered; while estimated scores
were computed using ratings from two teachers, observed scores were derived using
rating from a single teacher. This disparity might have affected the size of the correlation
coefficient.
Overall, the third assumption was somewhat difficult to judge due to its
methodological limitations. The most accurate association between different measures
will be found when the same teachers generate two sets of scores while participating in
both assessments. This study was not able to meet these criteria; teachers participated in
only one assessment in which they used only the EDD checklist to assess essays, and
their scores were compared with the original TOEFL scores awarded by ETS raters. This
methodological limitation might have caused confounding results that ultimately
threatened the score interpretations. If time and resources are available, an experimental
study is recommended in order to better understand the ways in which writing
performance assessments made using the EDD checklist relate to those made using other
measures. Comparing EDD scores with teachers‟ classroom assessments might also be
interesting.
Validity Assumption 4: The EDD checklist provides a useful diagnostic skill profile for
ESL academic writing.
The central principle of diagnostic assessment is that it formatively assesses fine-
grained knowledge processes and structures in a test domain, thus providing detailed
information about students‟ understanding of the test materials. The fourth assumption
addressed this thesis, examining the extent to which the writing skill profiles generated
![Page 215: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/215.jpg)
203
using the EDD checklist provided useful and sufficient diagnostic information about
students‟ strengths and weaknesses in ESL academic writing. The Reduced
Reparameterized Unified Model ([Reduced RUM], Hartz, Roussos, & Stout, 2002) was
used to model students‟ writing performance, and its outcomes were evaluated in order to
justify the assumption.
Various skills diagnosis measures supported the stability and accuracy of the
estimated diagnostic model. When model parameters were estimated using a Markov
Chain Monte Carlo (MCMC) algorithm, the overall pattern of the Markov Chain plots
indicated that convergence had occurred for most of the parameter estimates. The
descriptor parameters also supported the robustness and informativeness of the estimated
model by having most of the values close to 1 and r values smaller than 0.9; only
six descriptor parameters were eliminated from the initial Q-matrix entries. Overall
goodness of model fit was also satisfactory, with predicted score distributions
approximating observed score distributions.
The hierarchy of skill difficulty provided the most insightful findings, as the four
different analytic methods used all echoed the same result. The first approach was
associated with the proportion of skill masters (pk), suggesting that vocabulary use was
the most difficult skill, followed by content fulfillment, grammatical knowledge,
organizational effectiveness, and mechanics. This result was confirmed using skill
mastery classifications that showed that the greatest number of students mastered
mechanics, while the smallest number of students mastered vocabulary use. The most
common skill mastery pattern in each number of categories also provided additional
evidence; mechanics was the first skill that students mastered, as indicated by the
mastery pattern “00001”, while vocabulary use was the last skill that students mastered,
as indicated by “11101”. Finally, the skill mastery pattern across writing proficiency
levels showed that a substantial number of students in the advanced group had mastered
more difficult skills such as grammatical knowledge and vocabulary use, while the
majority of those in the beginner group mastered easier skills such as organizational
effectiveness and mechanics. These psychometric findings were consistent with ESL
writing research indicating that vocabulary use and content are the essential elements
characterizing high-level essays (Milanovic et al., 1996).
*
d*
dk
![Page 216: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/216.jpg)
204
The skills probability distribution also supported the diagnostic quality of the
student writing skill profiles. The estimated model generated various types of skills
mastery profiles; students who fell into the flat categories accounted for only some
portion of the total, indicating the model‟s high discriminant function. The consistency of
the skill classification was another valuable source that supported the quality of the
estimated model. Although mechanics showed slightly lower reliability indices compared
to other skills, overall consistency was high, further confirming that the estimated model
generated reliable diagnostic skill profiles.
Despite these encouraging results, evidence was also found that could undermine
the validity claim. Approximately 34% of the descriptors exhibited poor diagnostic
power, failing to effectively discriminate masters from non-masters. These descriptors
were relatively easy compared to others, with proportion-correct scores greater than the
mean of the total descriptors. In particular, D34 (indentation) appeared the most
problematic. It had an extremely low value, suggesting that students who had
mastered mechanics were not likely to correctly execute that skill in order to
appropriately indent the first sentence of each paragraph in their writing. This finding is
consistent with Polio‟s (2001) speculation that mechanics consists of various
heterogeneous components (such as indentation, capitalization, spelling, and
punctuation), and that it is difficult to form a unitary construct.
The instability of mechanics was also observed in further analysis. When the
skill mastery profiles were examined across essay prompts, a similar mastery proportion
was found for all skills but mechanics. Specifically, students in the intermediate group
exhibited a drastically different skill mastery pattern for mechanics across the two essay
prompts. These results were somewhat unexpected, since the bias analysis examining the
interaction between the descriptors and the essay prompts did not find any evidence that
the descriptors functioned differently across the two different essay prompts. One
possible answer might be that the size of bias was negligible at the descriptor level, but
when the descriptors were added up to form a skill, it became substantial. More
psychometrically rigorous analyses could better address the interaction effect.
The results of the case analysis provided more questions than answers. The six
selected cases had drastically different skill profiles despite their similar observed scores.
*
![Page 217: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/217.jpg)
205
If students were provided with both a total observed score and estimated skill profile,
they could be confused or have different interpretations of their writing skills proficiency.
However, it is also possible that this discrepancy could be interpreted as highlighting the
need for diagnostic skill profiles. Unlike a single observed score, which masks precise
and detailed information, diagnostic skill profiles specifically point to the areas in which
students show strengths and weaknesses. Care should therefore be taken when a
diagnostic score report is created and given to students.
The diagnostic usefulness of ESL academic writing skill profiles was examined
based upon both positive and negative evidence needed to justify the interpretive
argument. With a few exceptions, the results of the psychometric analyses suggested that
the estimated diagnostic model was robust, providing useful and sufficient diagnostic
information about student ESL academic writing performance. However, solely
quantitative evidence would be limited to supporting the validity claim. The next validity
assumption touches upon qualitative findings, focusing on the potential impact and
consequences of using the checklist in ESL academic writing classes.
Validity Assumption 5: The EDD checklist helps teachers make appropriate diagnostic
decisions and has the potential to positively impact teaching and learning ESL academic
writing.
The fifth and final assumption concerned the extent to which the EDD checklist
helped teachers make appropriate and confident diagnostic decisions, and gave teachers a
positive perception of the checklist‟s diagnostic usefulness. If teachers reported that the
checklist helped them to make appropriate and confident diagnostic decisions and had the
potential to positively impact diagnosing ESL academic writing skills and improving
their instructional practices, it would support the validity claim. Teacher perceptions and
judgments about the use of the EDD checklist were explored primarily using
questionnaire and interview data.
Of the many comments collected about the checklist and its use, a few are worth
noting. Teachers generally considered the checklist to be an effective diagnostic
assessment tool. As discussed in Chapter 6, they appreciated the checklist‟s
comprehensiveness and acknowledged that it covered multiple aspects of ESL academic
![Page 218: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/218.jpg)
206
writing. They also commented that the breakdown of writing skills helped them know
what to look for when assessing essays. Similarly positive evaluations were made in the
questionnaire responses, with most teachers reporting that the checklist was clear and
understandable and that they liked using it. Reported teacher confidence in using the
checklist was also high.
However, teachers also raised some potential issues, specifically whether the
checklist could help them to make appropriate diagnostic decisions. Of particular concern
was that student writing ability could not be fairly measured without taking essay length
into consideration. According to their observations, longer essays tended to be judged
more harshly because writers of longer essays had more opportunity to make mistakes.
Well-written essays were also sometimes scored worse than poorly-written ones because
the risk-taking strategies of more advanced writers resulted in additional mistakes and a
resultant loss of points. In other cases, shorter essays were harder to judge because they
did not provide enough evidence to meet the evaluation criteria. These problems were all
related to the characteristics of analytic rating system; the act of writing might remain
more than the sum of its parts and the analytic approach might not be able to
appropriately capture students‟ genuine writing ability, resulting in biased scores. From a
different perspective, these problems are also directly related to the need for diagnostic
skill profiles. As discussed in the case analysis, student writing skill profiles can be
drastically different despite similar observed scores. This indicates that a single observed
score could provide an inaccurate estimate of a student‟s writing proficiency because it
masks specific information about that student‟s strengths and weaknesses.
The potential impact of EDD assessment drew pointed attention. Teachers
generally felt that using the checklist could have a positive impact on their classroom
instruction. One teacher noted that teachers could greatly benefit from this kind of
diagnostic assessment tool because it would help them to identify not only the areas in
which students are experiencing difficulty, but also the areas in which they require
individual attention. Teachers also felt that using the checklist could have some negative
impact; a few cautioned that too much diagnostic feedback (such as marking all grammar
errors) could demotivate and disempower students, and that the amount and nature of the
feedback offered should be carefully determined as a result. Teachers also suggested that
![Page 219: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/219.jpg)
207
not all feedback should be provided at once, since students at different proficiency levels
would require different treatment. Indeed, some teachers claimed that specific diagnostic
information would be particularly useful for motivated students, but less effective for less
proficient students.
Despite the teachers‟ reluctance to provide detailed and thorough feedback to
students, research suggests that students want to receive substantial feedback from their
teachers. Surveys on student feedback preferences have found that students are inclined
to receive, attend to, and address feedback on all aspects of their writing (Cohen &
Cavalcanti, 1990; Ferris, 1995, 2003; Ferris & Roberts, 2001; Hedgcock & Lefkowitz,
1994; Hyland, 1998; Lee, 2004; Leki, 1991; Zhang, 1995). Similarly, research findings
on the effect of teacher feedback are somewhat contradictory; while some researchers
have urged teachers to provide one type of feedback at a time (e.g., Hughey et al., 1983),
others have noted that ESL writing students can deal with multiple types of feedback on
the same draft (e.g., Boiarsky, 1984; Fathman & Whalley, 1990). Although the findings
are inconclusive, a consensus appears to have been reached in ESL writing literature;
students appreciate clear, concrete, and specific feedback (Ferris, 1995; Straub, 1997). If
the students‟ needs for diagnostic feedback are taken seriously, the EDD checklist could
be used to provide such feedback.
The intended and unintended consequences of EDD assessment are another point
worth noting. Teachers cautioned that student writing ability should not be determined by
a single assessment because they do not make the same mistakes all the time and their
performance does not remain static. It was also pointed out that students might
misinterpret the specific diagnostic information derived from a single assessment. If the
diagnostic feedback provided to students is outdated and does not capture student writing
progress appropriately, it could unintentionally deliver the wrong message. However, if
teachers provide longitudinal feedback in a timely manner on multiple drafts that take the
theories of development in ESL writing into consideration, these unintended negative
consequences would be reduced.
![Page 220: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/220.jpg)
208
Implications
Theoretical Implications
Usefulness of an Empirical Approach to Scale Development
The results of this study support the idea that an empirical approach is useful
when developing an assessment scheme (Brindley, 1998; Fulcher, 1987, 1993, 1996b,
1997; Upshur & Turner, 1995, 1999). As many researchers have pointed out, the most
serious problem with intuition-based or a priori rating scales is that it is not always clear
how the scale descriptors were created (or assembled) and calibrated (e.g., Brindley,
1998; Chalhoub-Deville, 1997; de Jong, 1988; Lantolf & Frawley, 1985; North, 1993;
Pienemann, Johnson, & Brindley, 1988; Upshur & Turner, 1995). In light of these
problems, this study aimed to demonstrate the benefits of an assessment scheme
developed using an empirical approach. Not only did teachers‟ think-aloud verbal
protocols provide rich verbal descriptions of the assessment criteria to be assessed, but a
series of conditional covariance-based nonparametric dimensionality techniques were
utilized to empirically identify their dimensional structure. A theoretical analysis further
confirmed the assessment criteria. These findings demonstrate the effectiveness of the
empirical approach to assessment scheme development, and underscore the importance
of its use for diagnostic purposes, as the identification of specific assessment elements is
the most important procedure in implementing diagnostic assessment.
Integration of Feedback Research in L2 Writing and the Diagnostic Approach in
Educational Assessment
This study also filled the gap between feedback research in L2 writing and the
diagnostic approach in educational assessment. Although they have the same overarching
goal, the focus of research in these two areas lies in different directions. Most feedback
research in L2 writing examines the effect of different types of feedback using a
qualitative method or case studies, while diagnostic educational assessment is focused
primarily on developing and implementing a psychometric diagnostic model using large-
scale test data. Recent technological advancements integrating diagnostic feedback to L2
writing also have certain limitations; automated feedback programs, such as the E-Rater
and Criterion®
developed by the Educational Testing Service (ETS), are limited to
![Page 221: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/221.jpg)
209
assessment of writing constructs, since they are focused on more narrowly defined
artifacts of ESL writing skills (Hyland & Hyland, 2006).
This study attempted to expand the scope of feedback research in L2 writing by
introducing a new measurement technique, cognitive diagnostic assessment (CDA). The
CDA technique used in this study, the Reduced RUM, provided a robust diagnostic
model that generated useful and sufficient diagnostic skill profiles for student ESL
academic writing performance. The findings from this study suggest that a psychometric
diagnostic model can be applied to feedback research in L2 writing in order to
formatively assesses fine-grained ESL writing processes and structures in a test domain,
thereby opening a much-needed avenue for additional research in this area.
Classification of L2 Writing Scales
This study further reconceptualized current L2 writing scale classifications.
Despite an increasing need for diagnostic assessment, very few scales (e.g., Knoch‟s
[2007] diagnostic ESL academic writing scale) have been developed to offer such
assessment in L2 academic writing. In the L2 writing assessment literature, rating scales
are classified primarily as holistic, analytic, or primary trait scales (based on scoring
methods) or as user-oriented, assessor-oriented, or constructor-oriented (based on
assessment purpose), with little consideration to their formative or summative nature. In
response, this study developed and validated a diagnostic ESL writing assessment
scheme, contributing to the current L2 writing scale literature.
Usefulness of Argument-based Approaches to Validity
By building and supporting arguments for the score-based interpretation and use
of the EDD checklist in ESL academic writing, this study demonstrated the usefulness of
an argument-based approach to validity. First proposed by Kane (1992), it suggests that
test-score interpretation is associated with a chain of interpretive arguments, and that the
plausibility of those arguments determines the validity of test-score interpretations. In
this study, the central research questions were formulated based upon the logical process
of argument-based approach to validity, guiding a set of comprehensive procedures for
the development of the checklist and justifying its score-based interpretations and uses.
![Page 222: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/222.jpg)
210
Although this study did not explicitly propose rebuttals (i.e., counter-arguments) for the
use of the EDD checklist, this evidentiary reasoning process made it possible to address
various aspects of validity inferences and to examine assumptions pertaining to different
types of evidence. It also demonstrated that a coherent and unified set of procedures
guide test developers and help assessment users to formulate and justify their
interpretations and assessment decisions (Bachman, 2005; Kane, 2001). This argument-
based approach to validation provides an overarching framework that could offer greater
insight into ESL research problems.
Practical Implications
Development of a Diagnostic Score Report Card
A well-developed diagnostic assessment scheme can make major contributions
to instructional practice and can have direct implications for student learning. It will be
useful not only for ESL teachers to identify the areas in which ESL students most need
improvement and track their progress, but also for ESL students themselves to monitor
and guide their learning processes. The diagnostic approach will also be of value to
curriculum developers, who are charged with designing effective ESL curricula in order
to maximize educational benefits.
One way of providing such a benefit is through the development of a diagnostic
score report card. As discussed earlier, effective diagnostic feedback is characterized as
something that is concrete, descriptive, fairly direct, and addresses all aspects of the
performance to be assessed, so that students can interpret the results and take appropriate
future action (Alderson, 2007; Black & William, 1998; Ferris, 1995, 2003; Shohamy,
1992; Spolsky, 1990; Straub, 1997). At the same time, the purpose of diagnostic
feedback is to inform diverse teaching and learning stake-holders (Nichols, 1994;
Nichols, Chipman, & Brennan, 1995; Leighton & Gierl, 2007; Pellegrino & Chudowsky,
2003). As Shohamy (1992) argued,
The main reason that tests can be useful is that they can provide administrators,
teachers, and students with valuable information about and insight into teaching
and learning. This information can then be utilized to improve learning and
teaching. For example, information obtained from tests can provide evidence of
students‟ ability over a whole range of skills and subskills, achievement and
proficiency, and on a continuous basis. (p. 514)
![Page 223: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/223.jpg)
211
If student performance can be tracked over time taking effort, improvement, and
progression into account, the positive impact on both teaching and learning would be
enormous. Along the same lines, if different types of diagnostic information can be
delivered to different types of stake-holders, diagnostic feedback would be maximized.
A hypothetical score report card was developed to provide a student named
Junko with diagnostic information about her ESL writing performance (see Figure 22).
She was assumed to have written an essay on one of the two prompts used in this study.
An adaptation of Jang‟s (2005, 2009a) DiagnOsis, this hypothetical version consisted of
four parts: (a) overall writing ability, (b) writing score, (c) writing skills profile, and (d)
writing skills that need to be improved, with each part written in simple enough language
for Junko to understand the report easily. The first part of the report card, Your Overall
Writing Ability, describes Junko‟s overall performance, pointing out the writing skills in
which she is most and least proficient. Notably, overall holistic writing proficiency levels
such as Level 1, Level 2, Level 3, or Level 4 are not reported, since the skills diagnosis
approach does not render a single composite score or level that can mask student‟s
strengths and weaknesses.
The second part of the report card, Your Writing Score, presents the number of
points earned by Junko across the 35 descriptors, classified into easy, medium, and
difficult categories based upon the difficulty measures estimated by the MFRM analysis.
The cut-off for each category was determined by visually inspecting the borderline in
which three descriptor clusters were distinctively divided. The easy category included 7
descriptors with difficulty measures ranging from -1.82 to -0.64 logits; the medium
category included 16 descriptors with difficulty measures ranging from 1.03 to 1.19
logits; and the difficult category included 12 descriptors with difficulty measures ranging
from 0.92 to 1.09 logits. The subscores that Junko earned across the three categories are
also reported. As with overall skill proficiency, however, the observed total score is not
reported because it could bias Junko‟s true writing ability. As the case analysis
demonstrated earlier, Junko‟s writing competence would differ fundamentally from that
of someone with the same total score who used descriptors with different difficulty
measures.
![Page 224: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/224.jpg)
212
The third part of the report card, Your Writing Skills Profile, provides a detailed
description of Junko‟s performance across five writing skills, with a bar graph
summarizing her mastery level for each skill. Instructions on how to read the graph are
provided next to it. The graph further classifies Junko‟s writing performance into mastery,
undetermined, and non-mastery states. A detailed description of her performance is then
provided for each skill on the next two pages. It is noteworthy that proficiency levels are
attached to each skill based upon Junko‟s posterior probability of mastery (ppm) for the
five writing skills. A carefully-designed standard-setting procedure might help to
accurately determine her skill proficiency levels for the five writing skills. In each skill
category, the skill definition is presented along with the characteristics of a competent
writer in that skill. Specific descriptors from which Junko earned points were also
presented.
The fourth and final part of the report card describes Writing Skills that Need to
be Improved, and provides examples of ways in which they can be improved. If Junko‟s
ESL writing teacher then expands this part with detailed guidelines, she could take
appropriate future action.
Care must be taken in using the diagnostic assessment report card. First and
foremost, the report card should be used to inform Junko, but not other stake-holders nor
for other purposes. As Alderson (1991) noted, assessment purposes and audiences are the
critical factors that must be considered in any assessment context. A different type of
score report should thus be developed to provide other stake-holders with diagnostic
information about Junko‟s writing performance. As such, it should be noted that the EDD
checklist was developed for the use in an academic context, and should not be used
without considering the context in which an assessment takes place. Second, the ways in
which Junko‟s writing skill mastery is classified and interpreted need to be carefully
determined. The Reduced RUM provides a limited mastery standing, including only
mastery, non-mastery, and undetermined states. A more finely-classified mastery
standing might be more informative for use in describing student ESL academic writing
ability. Finally, technological advances could enable this report card to be incorporated
into computer-assisted assessment, allowing writing samples to be automatically scored
and students to receive immediate individualized diagnostic feedback on their writing.
![Page 225: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/225.jpg)
213
Diagnostic ESL Writing Profile Student Name: Junko Sawaki
Your Overall Writing Ability
Your Writing Score
Descriptor (D) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
Your Score
Difficulty M M D M D D D D E D M M M D M E M M D M E E M M E D D M D M D M E M E
Got a point Didn’t get a point Didn’t apply
D M E
Difficult Medium Easy
You got points for 7/12 difficult descriptors. 7/16 medium descriptors. 6/7 easy descriptors.
Your Writing Skills Profile
Five writing skills were assessed based on the essay that you wrote. These are Content Fulfillment (CON), Organizational Effectiveness (ORG), Grammatical Knowledge (GRM), Vocabulary Use (VOC), and Mechanics (MCH).
Figure 22. An example of the diagnostic ESL writing profile
0.0 0.2 0.4 0.6 0.8 1.0
MCH
VOC
GRM
ORG
CON
Probability
Wri
tin
g Skill
s
How to interpret the graph o The graph illustrates the degree to which you have mastered each of
the five writing skills. o If the bar does not reach 0.4 of the probability area, you might need to
improve the skill. o If the bar lies between 0.4 and 0.6 of the probability area, it is difficult
to determine your level of mastery.
o If the bar stretches beyond 0.6 of the probability area, you may have mastered that particular skill.
You demonstrated excellent grammar and vocabulary knowledge in your essay. You were also able to apply a variety of English grammar rules effectively and use the appropriate words in the given context. However, your writing skills were relatively weak in terms of constructing good content, organizing the structure of your essay, and following English writing conventions. In particular, you were least successful at presenting a clear topic sentence, presenting a unified idea in each paragraph, and expanding your ideas well throughout each paragraph.
![Page 226: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/226.jpg)
214
CONTENT FULFILLMENT ★★★☆*
Content fulfillment assesses the degree to which a writer satisfactorily addresses a given topic. A writer who shows strength in this area generally demonstrates an excellent understanding of the topic by presenting clear and substantial arguments supported by specific examples.
* The number of black stars indicates the level of skill proficiency (e.g., ★★★☆ = Level 3). ** The superscript number indicates the number of a descriptor.
ORGANIZATIONAL EFFECTIVENESS ★☆☆☆
Organizational effectiveness assesses the way in which a writer organizes and develops his or her ideas. A writer who is competent in this area generally demonstrates the ability to construct and develop a paragraph effectively and to connect textual elements well, both within and between paragraphs using appropriate cohesive and transitional devices.
GRAMMATICAL KNOWLEDGE ★★★★
Grammatical knowledge assesses the extent to which a writer demonstrates consistent ability to properly apply the rules of English grammar. A well-written essay adheres to English grammar rules with full flexibility and accuracy, and displays a variety of syntactic structures and few linguistic errors.
You might need more work in:
1** Understanding a given question and answering accordingly.
2 Writing a clear essay that can be read without causing any
comprehension problems for readers.
4 Presenting a clear thesis statement.
6 Providing enough supporting ideas and examples.
8 Providing specific and detailed supporting ideas and examples.
You might be able to:
9 Organize your ideas into paragraphs and include an introductory
paragraph, a body, and a concluding paragraph.
12 Connect each paragraph to the rest of the essay.
14 Use linking words effectively.
You might need more work in:
10 Presenting a clear topic sentence that ties to supporting sentences in
each body paragraph.
11 Presenting one distinct and unified idea in each paragraph.
13 Developing or expanding ideas well throughout each paragraph.
You might need more work in:
17 Making complete sentences.
18 Connecting independent clauses correctly.
19 Using grammatical or linguistic features correctly in order not to
impede comprehension.
20 Using verb tenses appropriately.
25 Making pronouns agree with their referents.
You might be able to:
3 Write concisely and present few redundant ideas or linguistic
expressions.
5 Make strong arguments.
7 Provide appropriate and logical supporting ideas and examples.
You might be able to:
15 Use a variety of sentence structures.
16 Demonstrate an understanding of English word order.
21 Demonstrate consistent subject-verb agreement.
22 Use singular and plural nouns appropriately.
23 Use prepositions appropriately.
24 Use articles appropriately.
![Page 227: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/227.jpg)
215
VOCABULARY USE ★★★★
Vocabulary use assesses the extent to which a writer demonstrates great depth and breadth of vocabulary knowledge. A writer who is strong in this area generally uses a broad range of sophisticated words, knows how to combine words, and displays accurate knowledge of word form and usage.
MECHANICS ★★★☆
Mechanics assesses the extent to which a writer follows the conventions of English academic writing. A writer who is strong in this area generally demonstrates correct use of spelling, punctuation, capitalization, and indentation.
Writing Skills that Need to be Improved
Figure 22. An example of the diagnostic ESL writing profile (Continued)
You might need more work in:
26 Using sophisticated or advanced vocabulary.
You might be able to:
31 Spell words correctly.
33 Use capital letters appropriately.
34 Indent each paragraph appropriately.
35 Use appropriate tone and register throughout the essay.
You might need more work in:
32 Using punctuation marks appropriately.
Learn more about how to effectively organize an essay structure. Before writing an essay, you might need to think about what the thesis of your essay is and what the topic sentences for each paragraph will be. A good essay generally presents a clear thesis statement in the introduction and topic sentences at the beginning of the body paragraphs. Also, try to present one distinct idea in each paragraph. When more than one idea is presented in a single paragraph, it weakens your arguments and causes comprehension difficulties for readers. In addition, whenever you present your idea, try to expand it fully throughout each paragraph. If you develop your writing skills in these areas, you will be a more competent writer!
You might be able to:
27 Use a wide range of vocabulary.
28 Choose appropriate vocabulary to convey the intended meaning.
29 Combine and use words appropriately.
30 Use appropriate word forms (noun, verb, adjective, adverb, etc).
![Page 228: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/228.jpg)
216
Suggestions for Future Research
This study has addressed different aspects of research reporting on issues from
three intersecting areas: (a) ESL academic writing, (b) diagnostic assessment, and (c)
empirical methods in scale development. Despite increasing interest in and need for a
diagnostic approach, few diagnostic assessment schemes have been developed that align
with the empirical and theoretical sources of ESL academic writing. In response, a new
diagnostic assessment scheme for ESL academic writing, called the Empirically-derived
Descriptor-based Diagnostic (EDD) checklist, was developed and its score-based
interpretations and uses were validated. The checklist‟s validation process opened
possibilities for future research in the areas discussed below.
If data could have been collected from students, the EDD checklist would have
incorporated their cognitive processes. Although the checklist was constructed using
think-aloud verbal data from teachers, which focused on what they considered important,
it was unable to fully reflect the actual writing knowledge, processes, and strategies
exhibited by students in their writing. If these writing processes could have been
observed using students‟ introspective or retrospective verbal protocols, the writing
abilities to be assessed might have been better understood, and a more valid assessment
tool would have been created. Research calling for greater incorporation of students‟
perspectives is scarce in current scale development literature, and further research is
warranted in this area.
Another area for further investigation is the application of CDA models to
polytomous data. Although the teachers were fully aware that writing competence cannot
be dichotomized, they were asked to make binary choices while using the checklist. This
lack of a continuum on which writing performance can be measured increased their
psychological load and resulted in low inter-teacher reliability. Although teachers‟ rating
data could have been gathered using a scale and then artificially dichotomized (e.g.,
“strongly disagree” and “somewhat disagree” = no, “strongly agree” and “somewhat
agree” = yes), this method was not considered because it could distort teachers‟ decisions
and manipulate the data. The lack of a scale in the checklist was primarily because most
current CDA models do not deal with polytomous data, and while a few (e.g., RUM
[Hartz et al., 2002], General Diagnostic Model [von Davier, 2005]) have begun to take
![Page 229: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/229.jpg)
217
polytomous data into account, their robustness has not been intensively examined with
real item response data. The sample size needed to handle polytomous data was another
concern, since those models require a larger sample in order to estimate the greater
number of parameters. If these problems can be resolved, the EDD checklist could be
revised to include a scale that enables teachers to make multi-level decisions about
student writing performance.
Along the same lines, further consideration is needed as to whether theories of
ESL writing development truly correspond to the underlying assumptions of the CDA
model used in this study. The Reduced RUM uses a discrete representation of knowledge
structure, dichotomizing skill mastery probability into mastery and non-mastery.
However, it is doubtful that such dichotomization is conceptually possible in ESL writing
assessment. For example, if a student is judged to be a master of grammatical knowledge,
does it mean that he or she has absolutely mastered that skill, and does not need to work
on that area further? It would be interesting to explore the potential of latent trait CDA
models (e.g., MIRT-C [Reckase & McKinley, 1991], MIRT-NC [Sympson, 1977]), so
that student knowledge structures can be scaled according to a continuous ability
continuum. Similarly, interaction among writing skills must be further examined.
Although the Reduced RUM assumed the conjunctive interaction among writing skills, it
was not substantively examined. If students can get a point on a descriptor without
executing all of the skills required for that descriptor, compensatory CDA models will be
equally suitable for estimating student writing ability. Future research could investigate
which CDA model best represent theories of ESL writing.
Of particular interest are potential applications for CDA in the area of integrative
assessment. Alderson (2005) speculated that a diagnostic approach is not encouraged in
direct writing assessment because it is tailored to assess discrete low-level language
abilities rather than integrative high-order skills. Despite this assumption, this study used
a CDA model to diagnose student ESL writing competence. The EDD checklist broke
down global writing ability into specific measurable elements, with performance
assessed based upon the extent to which these elements were mastered. However,
whether a discrete-point method can ever define and assess the construct of ESL writing
remains unclear. The extent to which the discrete-point method in a CDA model can
![Page 230: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/230.jpg)
218
explain the act of human writing encountered in real-life situations is also uncertain. If
these problems continue to threaten the knowledge representation and the authenticity of
assessments, further research could explore alternative ways of diagnosing high-order
global writing ability.
A new diagnostic ESL writing test created in collaboration with teachers and test
developers would also be interesting. The current practice of retrofitting CDA models to
existing non-diagnostic tests is problematic (DiBello et al., 2007; Jang, 2009a; Lee &
Sawaki, 2009b); indeed, the ESL essays used to develop the EDD checklist were
originally a part of a non-diagnostic test, so it is not known whether the checklist would
take a different form if essays written for diagnostic purposes were used. A computer
technology-assisted assessment system could be a promising resource for effectively
delivering new diagnostic tests and feedback in this regard. In an ESL writing context, it
could mean that students would be asked to complete a writing task online, and would
receive immediate feedback tailored to their performance. The authenticity of such a test
would be greater if it can present the target language features found in real-life situations,
such as sending an email or posting comments on a web site.
The final recommendation is associated with the EDD checklist‟s use in real
classroom teaching and learning setting. This study examined the checklist‟s
effectiveness with limited use and was not able to explore how the checklist would be
used in a real ESL academic writing class due to logistical problems. The next logical
step would thus be to observe how teachers and students (who did not participate in the
current study) might use the checklist in actual practice, in order to interpret assessment
outcomes from a longitudinal perspective. Teachers might want to use the checklist to
track students‟ writing performance over time, so that students receive both short- and
long-term feedback. This continued investigation would be particularly important,
considering that current research in ESL writing focuses on process-oriented writing
instruction in which students revise and resubmit multiple drafts of their work. As
Watanabe (2004) noted, long-term washback may be challenged without examining
substantial continued effects of an assessment.
![Page 231: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/231.jpg)
219
REFERENCES
Alderson, J. C. (1990a). Testing reading comprehension skills (Part one). Reading in a
Foreign Language, 6, 425-438.
Alderson, J. C. (1990b). Testing reading comprehension skills (Part two): Getting
students to talk about taking a reading test (A pilot study). Reading in a Foreign
Language, 7, 465-503.
Alderson, J. C. (1991). Bands and scores. In J. C. Alderson & B. North (Ed.), Language
testing in the 1990s (pp. 71-86). London: Macmillan.
Alderson, J. C. (2005). Diagnosing foreign language proficiency: The Interface between
Learning and assessment. London: Continuum.
Alderson, J. C. (2007). The challenge of (diagnostic) testing: Do we know what we are
measuring? In J. Fox, M. Wesche, D. Bayliss, L. Cheng, C. Turner, & C. Doe
(Ed.), Language testing reconsidered (pp. 21-39). Ottawa: University of Ottawa
Press.
Alderson, J. S., & Lukmani, Y. (1989). Cognition and Reading: Cognitive levels as
embodied in test questions. Reading in a Foreign Language, 5, 253-270.
American Council on the Teaching of Foreign Languages (ACTFL). (2001). ACTFL
proficiency guidelines. Hastings-on-Hudson, NY: ACTFL.
Arnaud, P. J. L. (1992). Objective lexical and grammatical characteristics of L2 written
composition and the validity of separate-component tests. In P. J. L. Arnaud & H.
Béjoint (Ed.), Vocabulary and applied linguistics (pp. 133-145). London:
Macmillan.
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford
University Press.
Bachman, L. F. (2003). Constructing an assessment use argument and supporting claims
about test taker-assessment task interactions in evidence-centered assessment
design. Measurement: Interdisciplinary Research and Perspectives, 1, 63-65.
Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment
Quarterly, 2, 1-34.
Bachman, L. F., & Palmer, A. S. (1982). The construct validation of some components of
communicative proficiency. TESOL Quarterly, 16, 449-464.
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford
University Press.
Bachman, L. F., & Savignon, S. J. (1986). The evaluation of communicative language
![Page 232: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/232.jpg)
220
proficiency: A critique of the ACTFL Oral Interview. The Modern Language
Journal, 70, 380-390.
Bardovi-Harlig, K. (1992). A second look at T-unit analysis. Reconsidering the sentence.
TESOL Quarterly, 26, 390-395.
Bardovi-Harlig, K., & Bofman, T. (1989). Attainment of syntactic and morphological
accuracy by advanced language learners. Studies in Second Language
Acquisition, 11, 17-34.
Barkaoui, K. (2008). Effects of scoring method and rater experience on ESL essay rating
processes and outcomes. Unpublished doctoral dissertation. University of
Toronto. Canada.
Beaman, K. (1984). Coordination and subordination revisited: Syntactic complexity in
spoken and written narrative discourse. In L. Hamp-Lyons (Ed.), Assessing ESL
writing in academic contexts (pp. 37-49). Norwood, NJ: Ablex.
Bereiter, C., & Scardamalia, M. (1987). The psychology of written composition. Hillsdale,
NJ: Lawrence Erlbaum Associates.
Bernhardt, E. B. (1984). Toward an information processing perspective in foreign
language reading. The Modern Language Journal, 68, 322-331.
Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University
Press.
Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in
Education, 5, 7-74.
Boiarsky, C. (1984). What the authorities tell us about teaching writing. Journal of
Teaching Writing, 3, 213-223.
Brindley, G. (1998). Describing language development? Rating scales and SLA. In L. F.
Bachman & A. D. Cohen (Ed.), Interfaces between second language acquisition
and language testing research (pp. 112-140). Cambridge: Cambridge University
Press.
Brown, J. D., & Bailey, K. (1984). A categorical instrument for scoring second language
writing skills. Language Learning, 34, 21-42.
Buck, G., & Tatsuoka, K. (1998). Application of rule-space methodology to listening test
data. Language Testing, 15, 118-142.
Canale, M. (1983). From communicative competence to communicative language
pedagogy. In J. C. Richards & R. W. Schmidt (Ed.), Language and
communication (pp. 2-27). London: Longman.
![Page 233: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/233.jpg)
221
Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to
second language teaching and testing. Applied Linguistics, 1, 1-47.
Casanave, C. P. (1994). Language development in students‟ journals. Journal of Second
Language Writing, 3, 179-201.
Celce-Murcia, M., & Larsen-Freeman, D. (1999). The grammar book: An ESL/EFL
teachers‟ course. Boston: Heinle & Heinle.
Chalhoub-Deville, M. (1997). Theoretical models, assessment frameworks and test
construction. Language Testing, 14, 16-33.
Charney, D. (1984). The validity of using holistic scoring to evaluate writing: A critical
overview. Research in the Teaching of English, 18, 65-81.
Cobb, T. (2006). Classic VP English version 3.0. Retrieved February 03, 2009, from
http://www.lextutor.ca/vp/.
Cohen, A. D., & Cavalcanti, M. (1990). Feedback on compositions: Teacher and student
verbal reports. In B. Kroll (Ed.), Second language writing: Research insights for
the classroom (pp. 155-177). Cambridge: Cambridge University Press.
Connor, U., & Carrell, P. (1993). The interpretation of tasks by writers and readers in
holistically rated direct assessments of writing. In J. Carson & I. Leki (Ed.),
Reading in the composition classroom (pp. 141-160). Boston: Heinle.
Cooper, C. R. (1977). Holistic evaluation of writing. In C. R. Cooper & L. Odell (Ed.),
Evaluating writing: Describing, measuring, judging (pp. 3-31). Urbana, IL:
National Council of Teachers of English.
Cooper, T. C. (1976). Measuring written syntactic patterns of second language learners of
German. The Journal of Educational Research, 69, 176-183.
Cooper, T. C. (1981). Sentence combining: An experiment in teaching writing. The
Modern Language Journal, 65, 158-165.
Council of Europe. (2001). The Common European Framework of Reference for
Languages: learning, teaching and assessment. Cambridge: Cambridge
University Press.
Creswell, J.W. (2003). Research design: Qualitative, quantitative and mixed methods
approaches. Thousand Oaks, California: Sage.
Crismore, A., Markkanen, R., & Steffensen, M. S. (1993). Metadiscourse in persuasive
writing: A study of texts written by American and Finnish university students.
Written Communication, 10, 39-71.
Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational
![Page 234: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/234.jpg)
222
measurement (2nd
ed.) (pp. 443-507). Washington, DC: American Council on
Education.
Cronbach, L. J. (1982). Designing evaluations of educational and social programs. San
Francisco: Jossey-Bass.
Cronbach, L. J. (1988). Five perspectives on validity argument. In H. Wainer (Ed.), Test
validity (pp. 3-17). Hillsdale, NJ: Erlbaum.
Crooks, T., Kane, M., & Cohen, A. (1996). Threats to the valid use of assessment.
Assessment in Education, 3, 265-285.
Crowhurst, M. (1987). Cohesion in argument and narration at three grade levels.
Research in the Teaching of English, 21, 185-201.
Cumming, A. (1990). Expertise in evaluating second language compositions. Language
Testing, 7, 31-51.
Cumming, A. (1997). The testing of writing in a second language. In C. Clapham & D.
Corson (Ed.), Encyclopedia of language and education: Volume 7 Language
testing and assessment (pp. 51-63). Dordrecht, Netherlands: Kluwer.
Cumming, A. (1998). Theoretical perspectives on writing. Annual Review of Applied
Linguistics, 18, 61-78.
Cumming, A. (2001). The difficulty of standards, for example in L2 writing. In T. Silva
& P. Matsuda (Ed.) On second language writing (pp. 209-229). Mahwah, NJ:
Lawrence Erlbaum.
Cumming, A. (2002). Assessing L2 writing: Alternative constructs and ethical dilemmas.
Assessing Writing, 8, 73-83.
Cumming, A., Kantor, R., & Powers, D. E. (2001). Scoring TOEFL essays and TOEFL
2000 prototype writing tasks: An investigation into raters' decision making and
development of a preliminary analytic framework. TOEFL Monograph Series 22.
Princeton, New Jersey: Educational Testing Service.
Cumming, A., Kantor, R., & Powers, D. E. (2002). Decision making while rating
ESL/EFL writing tasks: A descriptive framework. The Modern Language
Journal, 86, 67-96.
Cumming, A., Kantor, R., Powers, D. Santos, T., & Taylor, C. (2000). TOEFL 2000
writing framework: A working paper. TOEFL Monograph Series, Report No. 18.
Princeton, NJ: Educational Testing Service.
Cumming, A., & Riazi, A. M. (2000). Building models of adult second language writing
instruction. Learning and Instruction 10, 55-71.
![Page 235: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/235.jpg)
223
Dandonoli, P., & Henning, G. (1990). An investigation of the construct validity of the
ACFTL proficiency guidelines and oral interview procedure. Foreign Language
Annals, 23, 11-22.
Davies, A., Brown, A., Elder, C., Hill, K., Lumley, T., & McNamara, T. (1999).
Dictionary of language testing. Cambridge: Cambridge University Press.
de Jong, J. (1988). Rating scales and listening comprehension. Australian Review of
Applied Linguistics, 11, 73-87.
DiBello, L. V., Roussos, L. A., & Stout, W. (2007). Review of cognitively diagnostic
assessment and a summary of psychometric models. In C. R. Rao & S. Sinharay
(Ed.), Handbook of statistics, Volume 26, Psychometrics (pp. 979-1030).
Amsterdam, The Netherlands: Elsevier.
DiBello, L. V., & Stout, W. (2007). Guest editor‟s introduction and overview: IRT-based
cognitive diagnostic models and related methods. Journal of Educational
Measurement, 44, 285-291.
DiBello, L. V., & Stout, W. (2008). Arpeggio documentation and analyst manual (Ver.
3.1.001) [Computer software]. St. Paul: MN: Assessment Systems Corporation.
DiBello, L. V., Stout, W., & Roussos, L. A. (1995). Unified cognitive/psychometric
diagnostic assessment likelihood-based classification techniques. In Nichols P.
D., Chipman, S. F., Brennan, R. L. (Ed.), Cognitively diagnostic assessment (pp.
361-389). Erlbaum, Mahwah, NJ.
Douglas, J., Kim, H-R., Roussos, L., Stout, W., & Zhang, J. (1999). LSAT dimensionality
analysis for December 1991, June 1992, and October 1992 administrations
(Law School Admission Council Statistical Report 95-05). Newton, PA: LSAT.
Dulay, H., Burt, M., & Kreashen, S. (1982). Language Two. New York: Oxford
University Press.
Educational Testing Service (ETS). (2007). TOEFL iBT Tips: How to prepare for the
TOEFL iBT. Princeton, NJ: Educational Testing Service. Retrieved July 3, 2008,
from http://www.ets.org/Media/Tests/TOEFL/pdf/TOEFL_Tips.pdf
Engber, C. (1995). The relationship of lexical proficiency to the quality of ESL
compositions. Journal of Second Language Writing, 4, 139-155.
Ericsson, K. A., & Simon, H. A. (1993). Protocol analysis: Verbal reports as data.
Cambridge, MA: MIT Press.
Evola, J., Mamer, E., & Lentz, B. (1980). Discrete point versus global scoring for
cohesive devices. In J. W. Oller & K. Perkins (Ed.), Research in language testing
(pp. 177-181). Rowley, MA: Newbury House.
![Page 236: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/236.jpg)
224
Fathman, A. K., & Whalley, E. (1990). Teacher response to student writing: Focus on
form versus content. In B. Kroll (Ed.), Second language writing: Research
insights for the classroom (pp. 178-190). Cambridge, UK: Cambridge University
Press.
Ferris, D. (1995). Student reactions to teacher response in multiple-draft composition
classrooms. TESOL Quarterly, 29, 33-53.
Ferris, D. (2003). Responding to writing. In. B. Kroll (Ed.), Exploring the dynamics of
second language writing (pp. 119-140). New York: Cambridge University Press.
Ferris, D., & B. Roberts (2001). Error feedback in L2 writing classes: How explicit does
it need to be? Journal of Second Language Writing, 10, 161-184.
Figueras, N., North, B., Takala, S., Verhelst, N., & Avermaet, P. (2005). Relating
examinations to the common European framework: A manual. Language Testing,
22, 261-279.
Fischer, G. H. (1973). The linear logistic test model as an instrument in educational
research. Acta Psychologia, 37, 359–374.
Fischer, R. A. (1984). Testing written communicative competence in French. The Modern
Language Journal, 68, 13-20.
Fitzgerald, J., & Spiegel, D. L. (1986). Textual cohesion and coherence in children‟s
writing. Research in the Teaching of English, 20, 263-280.
Flower, L., & Hayes, J. (1981). A cognitive process theory of writing. College
Composition and Communication, 32, 365-387.
Foster, P., & Skehan, P. (1996). The influence of planning and task type on second
language performance. Studies in Second Language Acquisition, 18, 299-323.
Fournier, P. (2003). Blueprints: A guide to correct writing. Saint-Laurent, Quebec:
Pearson Longman.
Friedlander, A. (1990). Composing in English: Effects of a first language on writing in
English as a second language. In B. Kroll (Ed.), Second language writing:
Research insight for the classroom (pp. 109-125). Cambridge: Cambridge
University Press.
Fulcher, G. (1987). Tests of oral performance: The need for data-based criteria. English
Language Teaching Journal, 41, 287-291.
Fulcher, G. (1993). The construction and validation of rating scales for oral tests in
English as a foreign language. Unpublished doctoral dissertation, University of
Lancaster, UK.
![Page 237: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/237.jpg)
225
Fulcher, G. (1996a). Invalidating validity claims for the ACTFL oral rating scale. System,
24, 163-172.
Fulcher, G. (1996b). Does thick description lead to smart tests? A data-based approach to
rating scale construction. Language Testing, 13, 208-238.
Fulcher, G. (1997). The testing of L2 speaking. In C. Clapham & D. Corson (Ed.),
Encyclopedia of language and education: Volume 7 Language testing and
assessment (pp. 75-85). London: Kluwer.
Fulcher, G. (2003). Testing second language speaking. London: Pearson Longman.
Fulcher, G., & Davidson, F. (2007). Language testing and assessment: An advanced
resource book. London & New York: Routledge.
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (1995). Bayesian data analysis.
London: Chapman and Hall.
Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple
sequences. Statistical Science, 7, 457-511.
Glaser, B., & Strauss, A. L. (1967). The discovery of grounded theory: Strategies for
qualitative research. Chicago; IL: Aldine.
Goulden, N. R. (1992). Theory and vocabulary for communication assessments.
Communication Education, 41, 258-269.
Goulden, N. R. (1994). Relationship of analytic and holistic methods to rater‟s scores for
speeches. The Journal of Research and Development in Education, 27, 73-82.
Grabe, W. (2001). Notes toward a theory of second language writing. In T. Silva & P.
Matsuda (Ed.) On second language writing (pp. 39-57). Mahwah, NJ: Lawrence
Erlbaum.
Grabe, W., & Kaplan, R. (1996). Theory and practice of writing. New York: Longman.
Greene, J. C., Caracelli, V. J., & Graham, W. F. (1989). Toward a conceptual framework
for mixed-method evaluation design. Educational Evaluation and Policy
Analysis, 11, 255-74.
Grove, E., & Brown, A. (2001). Tasks and criteria in a test of oral communication skills
for first-year health science students. Melbourne Papers in Language Testing, 10,
37-47.
Haertel, E. H. (1989). Using restricted latent class models to map skill structure of
achievement items. Journal of Educational Measurement, 26, 301-321.
Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English. London: Longman.
![Page 238: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/238.jpg)
226
Hamp-Lyons, L. (1991). Scoring procedures for ESL contexts. In L. Hamp-Lyons (Ed.),
Assessing second language writing in academic contexts (pp. 241- 276).
Norwood, NJ: Ablex.
Hamp-Lyons, L. (1995). Rating nonnative writing: The trouble with holistic scoring.
TESOL Quarterly, 29, 759-762.
Hamp-Lyons, L., & Henning, G. (1991). Communicative writing profiles: An
investigation of the transferability of a multiple-trait scoring instrument across
ESL writing assessment contexts. Language Learning, 41, 337-373.
Hamp-Lyons, L., & Kroll, B. (1997). TOEFL 2000 writing: Composition, community,
and assessment. TOEFL Monograph Series Report No. 5. Princeton, NJ:
Educational Testing Service.
Harley, B., & King, M. L. (1989). Verb lexis in the written composition of young L2
learners. Studies in Second Language Acquisition, 11, 415-440.
Hartz, S. M. (2002). A Bayesian framework for the Unified Model for assessing cognitive
abilities: Blending theory with practicality. Unpublished doctoral dissertation.
University of Illinois at Urbana Champaign.
Hartz, S. M., Roussos, L., & Stout, W. (2002). Skills diagnosis: Theory and practice.
Unpublished manuscript. University of Illinois at Urbana Champaign.
Hayes, J. R. (1996). A new framework for understanding cognition and affect in writing.
In C. M. Levy & S. Ransdell (Ed.), The science of writing (pp. 1-27). Mahwah,
NJ: Lawrence Erlbaum Associates.
Hedgcock, J., & Lefkowitz, N. (1994). Feedback on feedback: Assessing learner
receptivity to teacher response in L2 composing. Journal of Second Language
Writing, 3, 141-163.
Hendrickson, J. M. (1980). Error correction in foreign language teaching: Recent theory,
research, and practice. In K. Croft (Ed.), Readings on English as a second
language. Cambridge, Mass: Winthrop Publishers.
Hinkel, E. (2003). Simplicity without elegance: Features of sentences in L1 and L2
academic texts. TESOL Quarterly, 37, 275-300.
Hinkel, E. (2004). Teaching academic writing: Practical techniques in vocabulary and
grammar. Mahwah: Lawrence Erlbaum Associates.
Homburg, T. (1984). Holistic evaluation of ESL compositions: Can it be validated
objectively? TESOL Quarterly, 18, 87-107.
House, E. R. (1980). Evaluating with validity. Beverly Hills, CA: Sage.
![Page 239: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/239.jpg)
227
Hughey, J. B., Wormuth, D. R., Hartfiel, V. F., & Jacobs, H. L. (1983). Teaching ESL
composition: Principles and techniques. Rowley, MA: Newbury House.
Hunt, K. W. (1970). Recent measures in syntactic development. In M. Lester (Ed.),
Readings in applied transformation grammar (pp. 179-192). New York: Holt,
Rinehart and Winston.
Huot, B. (1996). Toward a new theory of writing assessment. College Composition and
Communication, 47, 549-566.
Hyland, F. (1998). The impact of teacher written feedback on individual writers. Journal
of Second Language Writing, 7, 255-286.
Hyland, K. & Hyland, F. (2006). Feedback on second language students‟ writing.
Language Teaching, 3, 83-101.
Ingram, D. E. (1984). Introduction to the ASLPR. In Commonwealth of Australia,
Department of Immigration and Ethnic Affairs, Australian Second Language
Proficiency Ratings (pp. 1-29). Canberra: Australia Government Publishing
Service.
Intaraprawat, P., & Steffensen, M. S. (1995). The use of metadiscourse in good and poor
ESL essays. Journal of Second Language Writing, 4, 253-272.
Ishikawa, S. (1995). Objective measurement of low-proficiency EFL narrative writing.
Journal of Second Language Writing, 4, 51-70.
Jacobs, H., Zinkgraf, S., Wormuth, D., Hartfiel, V., & Hughey, J. (1981). Testing ESL
composition: A practical approach. Rowley, MA: Newbury House.
Jafarpur, A. (1991). Cohesiveness as a basis for evaluating compositions. System, 19,
459-465.
Jang, E. E. (2005). A validity narrative: the effects of cognitive reading skills diagnosis
on ESL adult learners‟ reading comprehension ability in the context of Next
Generation TOEFL. Unpublished doctoral dissertation. University of Illinois at
Urbana Champaign.
Jang, E. E. (2008). A framework for cognitive diagnostic assessment. In C. A. Chapelle,
Y.‐R. Chung, & J. Xu (Ed.), Towards adaptive CALL: Natural language
processing for diagnostic language assessment (pp. 117‐131). Ames, IA: Iowa
State University.
Jang, E. E. (2009a). Cognitive diagnostic assessment of L2 reading comprehension
ability: Validity arguments for Fusion Model application to LanguEdge
assessment. Language Testing, 26, 31-73.
Jang, E. E. (2009b). Demystifying a Q-Matrix for making diagnostic inferences about L2
![Page 240: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/240.jpg)
228
reading skills. Language Assessment Quarterly, 6, 210-238.
Junker, B., & Sijtsma, K. (2001). Cognitive assessment models with few assumptions,
and connections with nonparametric item response theory. Applied
Psychological Measurement, 25, 258–272.
Kameen, P. (1979). Syntactic skill and ESL writing quality. In C. Yorio, K. Perkinds, & J.
Schachter (Ed.), On TESOL 79: The learner in focus (pp. 343-364). Washington,
DC: TESOL.
Kane, M. (1992). An argument-based approach to validity. Psychological Bulletin, 112,
527-535.
Kane, M. (1994). Validating interpretive arguments for licensure and certification
examinations. Evaluation and the Health Professions, 17, 133-159.
Kane, M. (2001). Current concerns in validity theory. Journal of Educational
Measurement, 38, 319-342.
Kane, M. (2002). Validating high-stakes testing programs. Educational Measurement:
Issues and Practice, 21, 31-41.
Kane, M. (2004). Certification testing as an illustration of argument-based validation.
Measurement: Interdisciplinary Research and Perspectives, 2, 135-170.
Kane, M., Crooks, T., & Cohen, A. (1999). Validating measures of performance.
Educational Measurement: Issues and Practice, 18, 5-17.
Kasai, M. (1997). Application of the rule space model to the reading comprehension
section of the test of English as a foreign language (TOEFL). Unpublished
doctoral dissertation. University of Illinois at Urbana Champaign.
Kellogg, R. (1996). A model of working memory in writing. In C. M. Levy & S. Ransdell
(Ed.), The science of writing (pp. 57-71). Mahwah, NJ: Lawrence Erlbaum
Associates.
Kepner, C. (1991). An experiment in the relationship of types of written feedback to the
development of second-language writing skills. The Modern Language Journal,
75, 305-313.
Knoch, U. (2007). Diagnostic writing assessment: The development and validation of a
rating scale. Unpublished doctoral dissertation. University of Auckland. New
Zealand.
Kunnan, A. J. (2004). Test fairness. In M. Milanovic & C. Weir (Ed.), European
language testing in a global context (pp. 27-48). Cambridge, UK: Cambridge
University Press.
![Page 241: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/241.jpg)
229
Kunnan, A. J., & Jang, E. E. (2009). Diagnostic feedback in language assessment. In M.
Long & C. Doughty (Eds.), Handbook of second and foreign language teaching
(pp. 610–625). Walden, MA: Wiley-Blackwell.
Lantolf, J. P., & Frawley, W. (1985). Oral proficiency testing: A critical analysis. The
Modern Language Journal, 69, 337-345.
Larsen-Freeman, D. (1978). An ESL index of development. TESOL Quarterly, 12, 439-
448.
Larsen-Freeman, D. (1983). Assessing global second language proficiency. In H. W.
Seliger & M. Long (Ed.), Classroom-oriented research in second language
acquisition (pp. 287-304). Rowley, MA: Newbury House.
Larsen-Freeman, D., & Strom, V. (1977). The construction of a second language
acquisition index of development. Language Learning, 27, 123-134.
Laufer, B. (1991). The development of L2 lexis in the expression of the advanced learner.
The Modern Language Journal, 75, 440-448.
Laufer, B., & Nation, P. (1995). Vocabulary size and use: Lexical richness in L2 written
production. Applied Linguistics, 16, 307-322.
Lautamatti, L. (1978). Observations on the development of the topic in simplified
discourse. In V. Kohonen & N. E. Enkvist (Ed.), Text linguistics, Cognitive
learning, and language teaching (pp. 71-104). Turku, Finland: Afinla .
Lautamatti, L. (1987). Observations on the development of the topic of simplified
discourse. In U. Connor & R. B. Kaplan (Ed.), Writing across languages:
Analysis of L2 text. Reading. MA: Addison-Wesley.
Lee, I. (2004). Error correction in L2 secondary writing classrooms: The case of Hong
Kong. Journal of Second Language Writing, 13, 285-312.
Lee, J., & Musumeci, D. (1988). On hierarchies of reading skills and text types. The
Modern Language Journal, 72, 173-187.
Lee, Y-W., & Sawaki, Y. (2009a). Application of three cognitive diagnosis models to
ESL reading and listening assessments. Language Assessment Quarterly, 6, 239-
263.
Lee, Y-W., & Sawaki, Y. (2009b). Cognitive diagnosis approaches to language
assessment: An overview. Language Assessment Quarterly, 6, 172-189.
Leighton, J. P., & Gierl, M. J. (Ed.). (2007). Cognitive diagnostic assessment for
education: Theory and practices. Cambridge: Cambridge University Press.
Leki, I. (1991). The preferences of ESL students for error correction in college level
![Page 242: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/242.jpg)
230
writing classes. Foreign Language Annals, 24, 203-218.
Leki, I. (2006). “You cannot ignore”: L2 graduate students‟ response to discipline-based
written feedback. In K. Hyland & F. Hyland (Ed.), Feedback in second
language writing: Contexts and issues (pp. 266-285). New York: Cambridge.
Leki, I., & Carson, J. G. (1994). Students‟ perceptions of EAP writing instruction and
writing needs across the disciplines. TESOL Quarterly, 28, 81-101.
Leki, I., Cumming, A., & Silva, T. (2008). A synthesis of research on second language
writing in English. New York, NY: Routledge.
Leśniewska, J. (2006). Collocations and second language use. Studia Linguistica, 123,
95-105.
Linacre, J. M. (2009). A user‟s guide to Facets: Rasch-model computer programs.
Version3.66.0 [Computer software and manual]. Retrieved October 21, 2009,
from www.winsteps.com.
Linnarud, M. (1986). Lexis in composition: A performance analysis of Swedish learners‟
written English. Malmö: CWK Gleerup.
Liskin-Gasparro, J. (1984). The ACTFL guidelines: A historical perspective. In T. V.
Higgs (Ed.), Teaching for proficiency: The organizing principle (pp. 11-42).
Lincolnwood, IL: National Textbook.
Lloyd-Jones, R. (1977). Primary trait scoring. In C. R. Cooper & L. Odell (Ed.),
Evaluating writing (pp. 33-66). New York: National Council of Teachers of
English.
Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really
mean to the raters? Language Testing, 19, 246-276.
Lumley, T. (2005). Assessing second language writing: The raters‟ perspective. Frankfurt:
Peter Lang.
Lunz, M. E., & Stahl, J. A. (1990). Judge severity and consistency across grading periods.
Evaluation and the Health Professions, 13, 425-444.
Luoma, S. (2004). Assessing speaking. Cambridge: Cambridge University Press.
Lynch, B. K. (2001). Rethinking assessment from a critical perspective. Language
Testing, 18, 351-372.
Matthews, M. (1990). The measurement of productive skills: Doubts concerning the
assessment criteria of certain public examinations. ELT Journal, 44, 117-121.
McCarthy, M. (1990). Vocabulary. Oxford: Oxford University Press.
![Page 243: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/243.jpg)
231
McClure, E. (1991). A comparison of lexical strategies in L1 and L2 written English
narratives. Pragmatics and Language Learning, 2, 141-154.
McCulley, G. A. (1985). Writing quality, coherence, and cohesion. Research in the
Teaching of English, 19, 269-282.
McNamara, T. F. (1996). Measuring second language performance. London: Longman.
Messick, S. (1989). Validity. In R. Linn (Ed.), Educational measurement (3rd
ed.) (pp. 13-
103). New York: American Council on Education and Macmillan.
Milanovic, M., Saville, N., & Shuhong, S. (1996). A study of the decision making
behavior of composition markers. In M. Milanovic & N. Saville (Ed.), Studies in
language testing 3: Performance testing, cognition and assessment (pp. 92-111).
Cambridge: Cambridge University Press.
Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2002). Design and analysis in task-
based language assessment, Language Testing, 19, 477-496.
Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of educational
assessments. Measurement: Interdisciplinary Research and Perspectives, 1, 3-62.
Monroe, J. H. (1975). Measuring and enhancing syntactic fluency in French. The French
Review, 48, 1023-1031.
Mullen, K.A. (1977). Using rater judgements in the evaluation of writing proficiency for
nonnative speakers of English. In H. D. Brown, C. A. Yorio., & R. H. Crymes
(Ed.), On TESOL 77: Teaching and learning English as a second language:
Trends in research and practice (pp. 309-320). Washington, D.C.: TESOL.
Myford, C. M., & Wolfe, E. W. (2000). Monitoring sources of variability within the test
of spoken English Assessment System (Research Report No. 00-06). Princeton,
NJ: Educational Testing Service, Center for Performance Assessment.
Myford, C. M., & Wolfe, E. W. (2004a). Detecting and measuring rater effects using
many-facet Rasch measurement: Part I. In Smith, Jr., E. V. & Smith, R. M. (Ed.),
Introduction to Rasch measurement (pp. 460-517). Maple Grove, MN: JAM
Press.
Myford, C. M., & Wolfe, E. W. (2004b). Detecting and measuring rater effects using
many-facet Rasch measurement: Part II. In Smith, Jr., E. V. & Smith, R. M. (Ed.),
Introduction to Rasch measurement (pp. 518-574). Maple Grove, MN: JAM
Press.
Nas, G. (1975). Determining the communicative value of written discourse produced by
L2 learners. Utrecht, The Netherlands: Institute of Applied Linguistics.
Neuner, J. L. (1987). Cohesive ties and chains in good and poor freshman essays.
![Page 244: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/244.jpg)
232
Research in the Teaching of English, 21, 92-105.
Nichols, P. D. (1994). A framework for developing cognitively diagnostic assessments.
Review of Educational Research, 64, 575-603.
Nichols, P. D., Chipman, S. F., & Brennan, R. L. (Ed.). (1995). Cognitively diagnostic
assessment. NJ: Lawrence Erlbaum.
North, B. (1993). The development of descriptors on scales of language proficiency.
Washington, DC: National Foreign Language Center.
North, B. (1994). Scales of language proficiency: A survey of some existing systems.
Strasbourge: Council of Europe.
North, B. (1995). The development of a common framework scale of descriptors of
language proficiency based on a theory of measurement. System, 23, 445-465.
North, B. (1996). The development of a common framework scale of descriptors of
language proficiency based on a theory of measurement. Unpublished doctoral
dissertation, Thames Valley University.
North, B. (2000). The development of a common framework scale of language
proficiency. Oxford: Peter Lang.
North, B., & Schneider, G. (1998). Scaling descriptors for language proficiency scales.
Language Testing, 15, 217-263.
Omaggio Hadley, A. (1993). Teaching language in context (2nd
Ed.). Boston, Mass.:
Heinle & Heinle.
Pellegrino, J. W., & Chudowsky, N. (2003). The foundations of assessment.
Measurement: Interdisciplinary Research and perspectives, 1, 103-148.
Pellegrino, J. W., Chudowsky, N., & Glaser, R. (2001). Know what students know: The
science and design of educational assessment. Washington, DC: National
Academy Press.
Perkins, K. (1980). Using objective methods of attained writing proficiency to
discriminating among holistic evaluations. TESOL Quarterly, 14, 61-69.
Perkins, K. (1983). On the use of composition scoring techniques, objective measures,
and objective tests to evaluate ESL writing ability. TESOL Quarterly, 17, 651-
671.
Péry-Woodley, M-P. (1991). Writing in LI and L2: Analysing and evaluating learners‟
texts. Language Teaching. 24, 69-83.
Pienemann, M., Johnson, M., & Brindley, G. (1988). Constructing an acquisition-based
![Page 245: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/245.jpg)
233
procedure for second language assessment. Studies in Second Language
Acquisition, 10, 217-243.
Polio, C. (1997). Measures of linguistic accuracy in second language writing research.
Language Learning, 47, 101-143.
Polio, C. (2001). Research methodology in second language writing: The case of text-
based studies. In T. Silva and P. Matsuda. (Ed.) On second language writing (p.
91-116). Mahwah, NJ: Erlbaum.
Pollitt, A., & Murray, N. L. (1996). What raters really pay attention to? In M. Milanovic,
& N. Saville (Ed.), Studies in language testing 3: Performance testing, cognition
and assessment (pp. 74-91). Cambridge: Cambridge University Press.
QSR. (2008). NVivo 8: Getting started. QSR International.
Raimes, A. (1985). What unskilled ESL students do as they write: A classroom study of
composing. TESOL Quarterly, 19, 229-258.
Read, J. (2000). Assessing vocabulary. Cambridge: Cambridge University Press.
Reckase, M. D., & McKinley, R. L. (1991). The discriminating power of items that
measure more than one dimension. Applied Psychological Measurement, 15,
361-373.
Reid, J. (1992). A computer text analysis of four cohesion devices in English discourse
by native and nonnative writers. Journal of Second Language Writing, 1, 79-107.
Roussos, L. A., Stout, W., & Marden, J. (1998). Using new proximity measures with
hierarchical cluster analysis to detect multidimensionality. Journal of
Educational Measurement, 35, 1-30.
Roussos, L. A., Templin, J. L., & Henson, R. A. (2007a). Skills diagnosis using IRT-
based latent class models. Journal of Educational Measurement, 44, 293-311.
Roussos, L. A., DiBello, L. V., Stout, W., Hartz, S. M., Henson, R. A., & Templin, J. L.
(2007b). The fusion model skills diagnosis system. In J. P. Leighton & M. J.
Gierl. (Ed.), Cognitive diagnostic assessment for education: Theory and practice
(pp. 275-318). Cambridge: Cambridge University Press.
Sakyi, A. A. (2000). Validation of holistic scoring for ESL writing assessment: How
raters evaluate compositions. In A. J. Kunnan (Ed.), Fairness and validation in
language assessment: Selected papers from the 19th Language Testing Research
Colloquium (pp. 129-152), Orlando, Florida. Cambridge: Cambridge University
Press.
Sawaki, Y., Kim, H-J., & Gentile, C. (2009). Q-Matrix construction: Defining the link
between constructs and test items in large-scale reading and listening
![Page 246: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/246.jpg)
234
comprehension assessments. Language Assessment Quarterly, 6, 190-209.
Scardamalia, M., & Bereiter, C. (1987). Knowledge telling and knowledge transforming
in written composition. In S. Rosenberge (Ed.), Advances in applied
psycholinguistics, Volume 2: Reading, writing, and language learning (pp. 142-
175). Cambridge: Cambridge University Press.
Schneider, M., & Connor, U. (1990). Analyzing topical structure in ESL essays. Studies
in Second Language Acquisition, 12, 411-427.
Shaw, P., & Liu, E. (1998). What develops in the development of second-language
writing? Applied Linguistics, 19, 225-254.
Shepard, L. A. (1993). Evaluating test validity. In L. Darling-Hammond (Ed.), Review of
research in education (pp. 405-450). Washington, D. C.: Educational Research
Association.
Shohamy, E. (1992). Beyond proficiency testing: A diagnostic feedback testing model for
assessing foreign language learning. The Modern Language Journal, 76, 513-
521.
Silva, T. (1990). Second language composition instruction: Developments, issues, and
directions in ESL. In B. Kroll (Ed.), Second language writing (pp. 11-23).
Cambridge: Cambridge University Press.
Silva, T. (1992). L1 vs L2 writing: ESL graduate students‟ perceptions. TESL Canada
Journal, 10, 27-47.
Smith, D. (2000). Rater judgments in the direct assessment of competency-based second
language writing ability. In G. Brindley (Ed.), Studies in immigrant English
language assessment (pp. 159-189). Sydney, Australia: National Centre for
English Language Teaching and Research, Macquarie University.
Smith, P. C., & Kendall, J. M. (1963). Retranslation of expectations: An approach to the
construction of unambiguous anchors for rating scales. Journal of Applied
Psychology, 47, 149-155.
Snow, R. E., & Lohman, D. F. (1989). Implications of cognitive psychology for
educational measurement. In R. L. Linn (Ed.), Educational Measurement (3rd
ed.,
pp. 263-332). New York: Macmillan.
Sokal, R. R., & Michener, C. D. (1958). A statistical method for evaluating systematic
relationships. University of Kansas Science Bulletin, 38, 1409-1438.
Sperber, D., & Wilson, D. (1986). Relevance. Oxford: Harvard University Press.
Sperling, M. (1996). Revisiting the writing-speaking connection: Challenges for research
on writing and writing instruction. Review of Educational Research, 66, 53-86.
![Page 247: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/247.jpg)
235
Spolsky, B. (1990). Social aspects of individual assessment. In J. de Jong & D. K.
Stevenson (Ed.), Individualizing the assessment of language abilities (pp. 3-15).
Avon: Multilingual Matters.
Spolsky, B. (1992). The gentle art of diagnostic testing revisited. In E. Shohamy & A. R.
Walton (Ed.), Language assessment for feedback: Testing and other strategies
(pp. 29-41). Dubuque, IA: Kendall/Hunt.
Stout, W., Froelich, A., & Gao, F. (2001). Using resampling methods to produce an
improved DIMTEST procedure. In A. Boomsma, M. A. J. van Duijn, & T. A. B.
Snijders (Ed.), Essays on item response theory (pp. 357-376). New York:
Springer-Verlag.
Straub, R. (1997). Students‟ reactions to teacher comments: An exploratory study.
Research in the Teaching of English, 31, 91-119.
Swales, J. (1990). Genre analysis: English in academic and research setting. Cambridge:
Cambridge University Press.
Sympson, J. B. (1977). A model for testing with multidimensional items. In Weiss, D. J.
(Ed.), Proceedings of the 1977 computerized adaptive testing conference (pp. 82-
88). University of Minnesota, Department of Psychology, Psychometric Methods
Program, Minneapolis.
Tatsuoka, K. K. (1983). Rule space: An approach for dealing with misconceptions based
on item response theory. Journal of Educational Measurement, 20, 345-354.
Tatsuoka, K. K. (1990). Toward an integration of item-response theory and cognitive
error diagnosis. In N. Fredrickson, R. L. Glaser, A. M. Lesgold, & M. G. Shafto
(Ed.), Diagnostic monitoring of skills and knowledge acquisition (pp. 453-488).
Hillsdale, NJ: Erlbaum.
Tatsuoka, K. K. (1993). Item construction and psychometric models appropriate for
constructed responses. In R. E. Bennett & W. C. Ward (Ed.), Construction
versus choice in cognitive measurement (pp. 107-133). Hillsdale, NJ: Erlbaum.
Tatsuoka, K. K. (1995). Architecture of knowledge structures and cognitive diagnosis: A
statistical pattern recognition and classification approach. In P. D. Nichols, S. F.
Chipman, & R. L. Brennan (Ed.), Cognitively diagnostic assessment (pp. 327-
359). Hillsdale, NJ: Erlbaum.
Taylor, J. (1993). Prepositions: Patterns of polysemization and strategies of
disambiguation. In C. Zelinsky-Wibbelt (Ed.), Natural language processing (pp.
151-175). The Hague: Mouton de Gruyter.
Templin, J. L., & Henson, R. A. (2006). Measurement of psychological disorders using
cognitive diagnosis models. Psychological Methods, 11, 287-305.
![Page 248: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/248.jpg)
236
Thurstone, L. L. (1959). The measurement of values. Chicago: University of Chicago
Press.
Tierney, R., & Mosenthal, J. (1983). Cohesion and textual coherence. Research in the
Teaching of English, 17, 215-229.
Toulmin, S. E. (2003). The uses of argument (Updated ed.). Cambridge: Cambridge
University of Press.
Turner, C. E. (2000). Listening to the voices of rating scale developers: Identifying
salient features for second language performance assessment. The Canadian
Modern Language Review, 56, 555-584.
Turner, C. E., & Upshur, J. A. (1996). Developing rating scales for the assessment of
second language performance. In G. Wigglesworth & C. Elder (Ed.), The
language testing cycle: From inception to washback (pp. 55-79). Melbourne:
Australian Review of Applied Linguistics.
Turner, C. E., & Upshur, J. A. (2002). Rating scales derived from student samples:
Effects of the scale maker and the student sample on scale content and student
scores. TESOL Quarterly, 36, 49-70.
Underhill, N. (1987). Testing spoken language: A handbook of oral testing techniques.
Cambridge: Cambridge University Press.
University of Cambridge, British Council, & IELTS Australia. (2007). IELTS Handbook
2007. University of Cambridge, British Council, and IELTS Australia: UK.
University of Michigan. (2003). Michigan English Language Assessment Battery:
Technical Manual 2003. Testing and Certificate Division, English Language
Institute, University of Michigan, Ann Arbor.
Upshur, J. A., & Turner, C. E. (1995). Constructing rating scales for second language
tests. ELT Journal, 49, 3-12.
Upshur, J. A., & Turner, C. E. (1999). Systematic effects in the rating of second-language
speaking ability: Test method and learner discourse. Language Testing, 16, 82-
111.
Vande Kopple, W. J. (1985). Some exploratory discourse on metadiscourse. College
Composition and Communication, 36, 82-93.
Vann, R. J. (1979). Oral and written syntactic relationships in second language learning.
In C. Yorio, K. Perkinds, & J. Schachter (Ed.), On TESOL 79: The learner in
focus (pp. 322-329). Washington, DC: TESOL.
Vaughan, C. (1991). Holistic assessment: What goes on in the rater's mind? In L. Hamp-
Lyons (Ed.), Assessing second language writing in academic contexts (pp. 111-
![Page 249: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/249.jpg)
237
125). Norwood, New Jersey: Ablex Publishing Corporation.
von Davier, M. (2005). A General diagnostic model applied to language testing data.
ETS Research Report RR-05-16. Princeton, NJ: ETS.
Waller, T. (1993). Characteristics of near-native proficiency in writing. In H. Ringbom
(Ed.), Near-native proficiency in English (pp. 183-293). Å bo: Å bo Akademi
University.
Watanabe, Y. (2004). Methodology in washback studies. In L. Cheng, Y. Watanabe, & A.
Curtis (Ed.), Washback in language testing: Research contexts and methods (pp.
19-36). Mahwah, NJ: Lawrence Erlbaum Associates.
Watson Todd, R. (1998). Topic-based analysis of classroom discourse. System, 26, 303-
318.
Watson Todd, R., Thienpermpool, P., & Keyuravong, S. (2004). Measuring the coherence
of writing using topic-based analysis. Assessing Writing, 9, 85-104.
Weigle, S. C. (2002). Assessing writing. New York: Cambridge.
White, E. M. (1985). Teaching and assessing writing. San Francisco: Jossey-Bass.
Witte, S. (1983a). Topical structure analysis and revision: An exploratory study. College
Composition and Communication, 34, 313-341.
Witte, S. (1983b). Topical structure and writing quality: Some possible text-based
explanation of readers‟ judgments of student writing. Visible Language, 17, 177-
205.
Witte, S., & Faigley, L. (1981). Cohesion, coherence and writing quality. College
Composition and Communication, 32, 189-204.
Wolfe, E. W., Chiu, C. W. T., & Myford, C. M. (1999). The manifestation of common
rater effects in multi-faceted Rasch analyses (Monograph Series No. 97-20).
Princeton, NJ: Educational Testing Service, Center for Performance Assessment.
Wolfe-Quintero, K., Inagaki, S., & Kim, H-Y. (1998). Second language development in
writing: Measures of fluency, accuracy and complexity. Technical Report No. 17.
Honolulu, HI: University of Hawai‟i Press.
Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch
Measurement: Transactions of the Rasch Measurement SIG, 8, 370.
Yamamoto, K., & Gitomer, D. (1993). Application of a HYBRID model to a test of
cognitive skill representation. In N. Frederiksen, R. Mislevy, & I. Beijar (Ed.),
Test theory for new generation of tests. Hillsdale, NJ: LEA.
![Page 250: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/250.jpg)
238
Yule, G. (1985). The study of language. Cambridge: Cambridge University Press.
Zhang, S. (1995). Re-examining the affective advantages of peer feedback in the ESL
writing class. Journal of Second Language Writing, 4, 209-222.
Zhang, J., & Stout, W. (1999). The theoretical DETECT index of dimensionality and its
application to approximate simple structure. Psychometrika, 64, 213-249.
![Page 251: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/251.jpg)
239
APPENDIX A
DEFINITIONS OF KEY TERMS
Analytic scoring: a type of marking procedure in which raters award separate
subscores to diverse features of test performance.
Classical test theory (CTT): a measurement theory that assumes that an
examinee‟s true test score is obtained when no errors exist in measurement.
Cognitive Diagnostic Assessment (CDA): a type of assessment that measures the
specific cognitive knowledge structures of examinees in order to provide detailed
diagnostic information about their strengths and weaknesses.
Coherence: an aspect of discourse competence associated with organizing ideas
in spoken or written text.
Cohesion: an aspect of discourse competence associated with explicit linguistic
cues of semantic relationships in spoken or written text.
Construct: the ability or trait to be measured by a test.
DIALANG: an online diagnostic language assessment system that assesses five
aspects of language knowledge (reading, listening, writing, grammar and vocabulary) in
14 European languages.
Dimensionality: a measurement property of item responses derived from a test.
Discourse analysis: a linguistic approach to analyzing spoken or written text.
Formative assessment: a type of assessment intended to provide students and
teachers with immediate feedback that can improve teaching and learning during the
period of instruction.
Holistic scoring: a type of marking procedure in which raters award a single
composite score to the overall quality of test performance.
Item response theory (IRT): a measurement theory that concerns the estimation
of examinees‟ latent ability while taking item responses into account.
Objective measures: a means of quantifying observable characteristics or
qualities of speaking or writing performance by tallying the frequencies or calculating the
ratios of certain linguistic features that occur in a spoken or written corpus.
Organizational knowledge: an ability to control the structure of spoken or written
text using grammatical and textual knowledge.
Primary trait scoring: a type of marking procedure in which raters focus on the
particular writing traits in a writing task that are considered important within a specific
![Page 252: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/252.jpg)
240
context.
Profiling: a way of reporting test results in the form of comprehensive and
accessible descriptions.
Q-matrix: an incidence matrix that represents the relationship between skills and
items in a test.
Skill: an aspect of underlying ability characterizing the construct to be measured.
Sociolinguistic knowledge: the ability to produce and interpret spoken and
written text that is appropriate in a particular language use context.
Summative assessment: a type of assessment intended to provide assessment
outcomes to internal and external stakeholders at the end of a period of instruction, for
accountability purposes.
Test of English as a Foreign Language (TOEFL): a standardized English
proficiency test intended to assess examinees‟ ability to communicate effectively in
English in an academic context.
Think-aloud protocol: a research method involving participants thinking aloud
while they are performing a given task.
Washback: the positive or negative effect of testing on instruction.
![Page 253: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/253.jpg)
241
APPENDIX B
ESL TEACHER PROFILE
Name Age Gender Postgrad
studies
Prof.
certificate
ESL
teaching
(yrs)
ESL writing
teaching (yrs)
Familiar with
ESL writing
Training in
ESL writing
assessment
Competent in
assessing
ESL writing
Assessment
experience
(yrs)
Study
participation
(phase)
Angelina 30-39 F None TESL 7 6 Extremely No Very 5 2 (P, M)*
Ann above 50 F MA TESL 25 25 Extremely Yes Very 9 1 & 2 (P, M)
Beth 40-49 F None TESL 15 15 Extremely Yes Very 4 1 & 2 (P, M)
Brad 30-39 M MA None 10 7 Very Yes Very 3 2 (P, M)
Erin 30-39 F MA TESL 10 10 Extremely Yes Extremely 5 2 (M)
Esther 40-49 F MA TESL 7 5 Extremely Yes Very 2 1 & 2 (P)
George 40-49 M PhD TESL 14 14 Very No Very 10 1
Greg 30-39 M MA TESL 3 2 Extremely No Extremely 1 2 (M)
James 30-39 M MA None 5 2 Very No Very 1 1
Judy 40-49 F MA None 10 8 Extremely Yes Very 8 1
Kara 40-49 F None TESL 11 3 Extremely Yes Very 3 2 (M)
Sarah 40-49 F MA TESL 12 5 Extremely Yes Very 2 1 & 2 (M)
Shelley above 50 F MA TESL 7 2 Extremely Yes Extremely 2 1
Susan 40-49 F MA TESL 12 10 Extremely Yes Very 6 2 (P, M)
Tim above 50 M None TESL 6 6 Extremely Yes Extremely 4 1
Tom 30-39 M None TESL 9 9 Extremely No Very 2 2 (P, M)
Note. “P” refers to the pilot study, while “M” refers to the main study.
![Page 254: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/254.jpg)
242
APPENDIX C
GUIDELINES FOR A THINK-ALOUD SESSION
Warm-up
Thank you for your interest in this study. I am conducting a study that examines
an effective way to provide diagnostic feedback to ESL writers on a timed essay test. Due
to the complex nature of second language writing, ESL learners need to be well informed
about the strengths and weaknesses in their writing. Despite the interest in and need for a
diagnostic approach in ESL writing instruction and assessment, little is known about
what kinds of cognitive and linguistic skills or strategies must be diagnosed, and in what
ways. It is thus critical to have a good understanding of what ESL teachers think while
providing diagnostic feedback on students‟ ESL writing. This information will ultimately
enable detailed diagnostic description to be tailored to and made available to individual
ESL writers.
In this session, I would like to gather information about the ways in which you
provide diagnostic feedback on ESL timed essays. In particular, I am interested in the
writing skills and strategies that you attend to while assessing and providing feedback on
ESL essays. I will explain in greater detail what I would like you to do during this
session. I will give you a package of 10 essays and a copy of essay prompts. These
essays were written by adult ESL learners with a wide range of English proficiency in a
large-scale testing setting within 30 minutes. The two essay prompts were:
(a) Do you agree or disagree with the following statement? It is more important to
choose to study subjects you are interested in than to choose subjects to prepare
for a job or career. Use specific reasons and examples to support your answer.
(b) Do you agree or disagree with the following statement? In today's world, the
ability to cooperate well with others is far more important than it was in the past.
Use specific reasons and examples to support your answer.
Five essays were written on each prompt. Once you have a general understanding
of the essays and prompts, I would like to ask you to say aloud what you are thinking
about as you are providing diagnostic feedback on each essay. To facilitate the thinking-
aloud, you may want to write something down (e.g., comments, corrections, etc). If you
do so, say aloud what you are writing. You may read the essays silently or aloud,
![Page 255: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/255.jpg)
243
according to what best suits you. If you are reading silently, indicate which part of the
essay you are reading. I would also like to ask you to assign to it a mark from 1 to 5, in
which 5 is the most proficient and 1 is the least proficient. Please say in as much detail as
possible what you are thinking about while you provide feedback on the essays. Do you
understand what I want you to do?
Great! The most important thing in doing this task is to think aloud constantly
while you are reading and providing feedback on the essay. I don‟t want you to plan out
what you are going to say or explain to me what you are saying. Just act as if you are
alone in the room speaking to yourself. I would like to emphasize that it is important for
you to keep talking without a long interval of silence. If you are silent for any length of
time, I will remind you to keep thinking aloud. Is this clear to you?
Before we proceed with the main task, I will give you a practice exercise to help
you familiarize yourself with the think-aloud procedure. I would like you to multiply
these two numbers and tell me what you are thinking to produce an answer.
“What is the result of 12 x 21?”
Good! Now, I am going to give you this package. In it, you will find 10 essays
and a copy of three essay prompts. (Give the material to the teacher.) Do you have any
questions about the procedure before you begin?
Intervention Prompts
What are some of the issues that come to your mind regarding this essay?
Okay, now tell me what you are thinking as you are reading and providing feedback
on the essay.
What other things are you thinking? (Without any intervention, let the teacher finish
thinking aloud.)
Keep talking.
While you were thinking aloud, you said XXX. Can you elaborate a bit more on your
thought process?
![Page 256: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/256.jpg)
244
Follow-up Interview Questions
Did you have any problems with thinking aloud your thoughts?
What skills or strategies do you think are important in ESL writing?
How have you provided your students with feedback on their ESL writing?
What skills or strategies do you teach to improve students‟ ESL writing proficiency?
Background Questionnaire
Before closing this session, I would like to collect your background information.
Your answers to this questionnaire will help me better understand your teaching and
evaluation methods in ESL academic writing. All information will remain confidential,
and will be used for research purposes only. Do you have any questions? (Give a
questionnaire to the teacher).
I. Personal Profile
1. Age: 20 – 29 30 – 39 40 – 49 above 50
2. Gender: Male Female
3. First language(s):
If your first language is not English, please specify the other language(s) you speak at
home and at the workplace:
(a) at home: (b) at the workplace:
4. Educational background: (Please specify subject areas)
B.A. in
M.A. in
Ph.D. in
Professional Certificate in
Other training related to assessment and ESL writing
5. Current professional position:
![Page 257: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/257.jpg)
245
II. Professional Teaching Experience
6. How many years have you taught ESL to non-native English speakers?
7. In what type of language institute have you taught ESL?
Private language institute
College/University-bound language institute
College/University
Other, Specify:
8. Please specify course titles you have taught in the past or that you currently teach:
9. How many years have you taught ESL writing or ESL academic writing to non-native
English speakers?
(a) ESL writing: (b) ESL academic writing:
10. Do you have other professional writing experiences than teaching ESL (academic)
writing courses?
If you yes, specify:
III. Evaluation of ESL Academic Writing
11. How familiar are you with the written English of non-native English speakers?
A little Quite Very Extremely
12. How competent are you in assessing academic composition of non-native English
speakers?
A little Quite Very Extremely
13. Have you ever been trained as an assessor of ESL academic writing?
Yes No
If yes, specify the year(s) that you received training and the number of training hours
![Page 258: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/258.jpg)
246
completed (i.e., dates):
14. How many years and in what context have you assessed ESL academic writing?
15. If you have assessment experiences that might have influenced your assessment in
this study, please specify:
Closing Statement
Your think-aloud reports and answers to the interview/questionnaire have provided
valuable information about what writing skills or strategies need to be diagnosed in ESL
academic writing. Thank you so much for your interest and participation in this study.
![Page 259: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/259.jpg)
247
APPENDIX D
TEACHER QUESTIONNAIRE
APPENDIX D-1
TEACHER QUESTIONNAIRE I (FOR THE PILOT STUDY)
Your answers to the following questions will help me better understand your evaluations
of the EDD checklist. All information will remain confidential, and will be used for
research purposes only.
I. Personal Profile
1. Age: 20 – 29 30 – 39 40 – 49 above 50
2. Gender: Male Female
3. First language(s):
If your first language is not English, please specify the other language(s) you speak at
home and at the workplace:
(a) at home: (b) at the workplace:
4. Educational background: (Please specify subject areas)
B.A. in
M.A. in
Ph.D. in
Professional Certificate in
Other training related to assessment and ESL writing
5. Current professional position:
II. Professional Teaching Experience
6. How many years have you taught ESL to non-native English speakers?
![Page 260: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/260.jpg)
248
7. In what type of language institute have you taught ESL?
Private language institute
College/University-bound language institute
College/University
Other, Specify:
8. Please specify course titles you have taught in the past or that you currently teach:
9. How many years have you taught ESL writing or ESL academic writing to non-native
English speakers?
(a) ESL writing: (b) ESL academic writing:
10. Do you have other professional writing experiences than teaching ESL (academic)
writing courses?
If you yes, specify:
III. Evaluation of ESL Academic Writing
11. How familiar are you with the written English of non-native English speakers?
A little Quite Very Extremely
12. How competent are you in assessing the academic compositions of non-native
English speakers?
A little Quite Very Extremely
13. Have you ever been trained as an assessor of ESL academic writing?
Yes No
If yes, specify the year(s) that you received training and the number of training hours
completed (i.e., dates):
![Page 261: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/261.jpg)
249
14. How many years and in what context have you assessed ESL academic writing?
IV. Evaluation of the EDD Checklist
15. When you marked the given essays, how many times did you read them, on average?
Once Twice Three times More than three times
16. How much did you like the EDD checklist when marking the essays?
A little Quite Very Extremely
17. Were the EDD descriptors clearly understood?
Yes No
If no, specify descriptors that were ambiguous or not clearly understood:
18. Were the EDD descriptors redundant?
Yes No
If yes, specify descriptors that were redundant:
19. Were the EDD descriptors useful?
Yes No
If no, specify descriptors that were useless:
![Page 262: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/262.jpg)
250
20. Were the EDD descriptors relevant to ESL academic writing?
Yes No
If no, specify descriptors that were irrelevant to ESL academic writing:
21. Do you think that the EDD checklist is comprehensive enough to capture all
instances of ESL academic writing?
Yes No
If no, specify areas that the EDD checklist does not describe:
22. Was the EDD checklist conducive to making a binary-choice?
Yes No
If no, explain why the EDD checklist was not easy to making a binary-choice:
23. Were there particular descriptors that you think most or least important in developing
students‟ ESL academic writing?
![Page 263: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/263.jpg)
251
24. What do you think are the EDD checklist‟s strengths?
25. What do you think are the EDD checklist‟s weaknesses?
26. Do you think that the EDD checklist provides useful diagnostic information about the
strengths and weaknesses of students‟ ESL academic writing?
Thank you for your time!
![Page 264: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/264.jpg)
252
APPENDIX D-2
TEACHER QUESTIONNAIRE II (FOR THE MAIN STUDY)
Your answers to the following questions will help me better understand your evaluations
of the EDD checklist. All information will remain confidential, and will be used for
research purposes only.
I. Personal Profile
1. Age: 20 – 29 30 – 39 40 – 49 above 50
2. Gender: Male Female
3. First language(s):
If your first language is not English, please specify the other language(s) you speak at
home and at the workplace:
(a) at home: (b) at the workplace:
4. Educational background: (Please specify subject areas)
B.A. in
M.A. in
Ph.D. in
Professional Certificate in
Other training related to assessment and ESL writing
5. Current professional position:
II. Professional Teaching Experience
6. How many years have you taught ESL to non-native English speakers?
7. In what type of language institute have you taught ESL?
Private language institute
College/University-bound language institute
![Page 265: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/265.jpg)
253
College/University
Other, Specify:
8. Please specify course titles you have taught in the past or that you currently teach:
9. How many years have you taught ESL writing or ESL academic writing to non-native
English speakers?
(a) ESL writing: (b) ESL academic writing:
10. Do you have other professional writing experiences than teaching ESL (academic)
writing courses?
If you yes, specify:
III. Evaluation of ESL Academic Writing
11. How familiar are you with the written English of non-native English speakers?
A little Quite Very Extremely
12. How competent are you in assessing the academic compositions of non-native
English speakers?
A little Quite Very Extremely
13. Have you ever been trained as an assessor of ESL academic writing?
Yes No
If yes, specify the year(s) that you received training and the number of training hours
completed (i.e., dates):
14. How many years and in what context have you assessed ESL academic writing?
![Page 266: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/266.jpg)
254
15. What tools do you usually use to evaluate ESL academic writing?
Anecdotal notes (use word descriptions)
Checklists
Rating scales
Marks, scores (use numbers)
Other, specify:
16. Have you ever used a rating scale to evaluate ESL academic writing in your
classroom evaluation?
Yes No
If yes, what kind of rating scale have you used?
Holistic rating scales
Analytic rating scales
Empirical rating scales
Other, specify:
If yes, how did you like the rating scale?
17. What evaluation methods do you usually use to diagnose your students‟ ESL
academic writing?
18. How often have you diagnosed your students‟ progress in ESL academic writing?
Once a week Once in two weeks
![Page 267: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/267.jpg)
255
Once a month Once in a term
19. Do you consider diagnostic results when you teach?
Yes No
If no, explain why you do not consider diagnostic results when you teach.
IV. Evaluation of the EDD Checklist
20. When you marked the given essays, how many times did you read them, on average?
Once Twice Three times More than three times
21. How much did you like the EDD checklist when marking the essays?
A little Quite Very Extremely
22. Were the EDD descriptors clearly understood?
A little Quite Very Extremely
If there were descriptors that were ambiguous or not clearly understood no, please
specify:
23. Were the EDD descriptors redundant?
No A little Quite Very
If there were descriptors that were redundant, please specify:
![Page 268: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/268.jpg)
256
24. Were the EDD descriptors useful?
A little Quite Very Extremely
If there were descriptors that were useless, please specify:
25. Were the EDD descriptors relevant to ESL academic writing?
A little Quite Very Extremely
If there were descriptors that were irrelevant to ESL academic writing, please specify:
26. Do you think that the EDD checklist is comprehensive enough to capture all
instances of ESL academic writing?
A little Quite Very Extremely
If no, specify the areas that the EDD checklist does not describe:
27. Was the EDD checklist conducive to making a binary-choice?
A little Quite Very Extremely
If no, explain why the EDD checklist was not easy to making a binary-choice:
![Page 269: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/269.jpg)
257
28. Were there particular descriptors that you think most or least important in developing
students‟ ESL academic writing?
Thank you for your time!
![Page 270: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/270.jpg)
258
APPENDIX E
GUIDING INTERVIEW QUESTIONS FOR TEACHERS
APPENDIX E-1
GUIDING INTERVIEW QUESTIONS FOR TEACHERS (FOR THE PILOT STUDY)
1. Were the EDD descriptors clearly understood?
2. Were the EDD descriptors redundant?
3. Were the EDD descriptors useful?
4. Were the EDD descriptors relevant to ESL academic writing?
5. Do you think that the EDD checklist is comprehensive enough to capture all instances
of ESL academic writing?
6. Was the EDD checklist conducive to making a binary-choice?
7. Were there particular descriptors that you think most or least important in developing
students‟ ESL academic writing?
8. What do you think are the EDD checklist‟s strengths?
9. What do you think are the EDD checklist‟s weaknesses?
10. Do you think that the EDD checklist provides useful diagnostic information about the
strengths and weaknesses of students‟ ESL academic writing?
11. Would you elaborate the reasons why you liked (or did not like) the EDD checklist
when assessing essays?
![Page 271: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/271.jpg)
259
APPENDIX E-2
GUIDING INTERVIEW QUESTIONS FOR TEACHERS (FOR THE MAIN STUDY)
1. Why did you find the EDD checklist useful (or useless)?
2. Why do you think that the EDD checklist provides (or does not provide) useful
diagnostic information about the strengths and weaknesses of students‟ ESL academic
writing?
3. In what ways do you think that the diagnostic information provided by the EDD
checklist will (or will not) be useful for classroom instruction and assessment?
4. In what ways do you think that the diagnostic information provided by the EDD
checklist will (or will not) improve the way you teach ESL academic writing?
5. If you have any positive or negative comments about the use of the EDD checklist,
please tell me.
![Page 272: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/272.jpg)
260
APPENDIX F
TEXTUAL CHARACTERISTICS OF THE THREE ESSAY SETS
Table F-1
Characteristics of Essay Set 1
Essay
number
ETS
score
No. of
paragraphs
No. of
words
No. of
word types
Percentage
of K1
words (%)
Percentage
of K2
words (%)
Percentage
of AWL
words (%)
Lexical
density
1015 3 10 344 115 91.57 2.62 2.33 0.39
1025 3 4 396 153 87.88 4.55 5.56 0.49
1034 2 4 156 94 93.59 1.92 1.92 0.39
1128 4 4 383 200 86.42 2.87 4.44 0.50
1176 5 6 448 218 87.05 4.02 6.25 0.52
2013 2 4 332 122 94.88 1.20 3.01 0.44
2045 3 4 322 129 89.13 2.48 5.59 0.50
2063 3 4 306 138 92.16 2.29 3.59 0.38
2124 5 5 586 232 83.62 3.07 8.53 0.52
2236 4 2 553 241 86.26 3.98 6.15 0.47
Mean 3.4 4.7 382.6 164.2 89.26 2.9 4.74 0.46
![Page 273: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/273.jpg)
261
Table F-2
Characteristics of Essay Set 2
Essay
number
ETS
score
No. of
paragraphs
No. of
words
No. of
word types
Percentage
of K1
words (%)
Percentage
of K2
words (%)
Percentage
of AWL
words (%)
Lexical
density
1030 3 5 299 142 94.31 2.34 1.34 0.43
1057 2 4 173 102 84.39 5.20 6.94 0.51
1112 4 6 518 229 84.17 3.86 7.92 0.46
1135 4 4 547 205 89.76 2.38 5.48 0.46
1160 5 3 421 230 85.51 3.56 7.60 0.48
2019 2 11 269 134 84.39 8.18 2.97 0.49
2074 4 8 358 199 73.46 3.07 13.69 0.55
2078 3 5 332 148 91.87 4.22 1.20 0.45
2122 4 4 363 163 87.33 3.86 7.71 0.52
2134 5 5 510 275 76.86 2.75 6.67 0.55
Mean 3.6 5.5 379 182.7 85.21 3.94 6.15 0.49
![Page 274: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/274.jpg)
262
Table F-3
Characteristics of Essay Set 3
Essay
number
ETS
score
No. of
paragraphs
No. of
words
No. of
word types
Percentage
of K1
words (%)
Percentage
of K2
words (%)
Percentage
of AWL
words (%)
Lexical
density
1001 1 4 65 44 87.69 3.08 4.62 0.45
1004 3 3 268 100 92.16 1.49 2.99 0.42
1152 4 1 298 155 88.59 2.35 6.38 0.49
1183 5 5 367 198 88.56 2.18 4.63 0.51
1205 2 5 212 92 87.74 0.47 7.08 0.47
2039 2 3 220 112 92.27 3.64 4.09 0.53
2042 3 4 357 163 89.92 5.04 4.20 0.47
2144 5 5 411 205 82.73 4.38 9.49 0.46
2163 4 4 322 147 86.34 1.86 9.32 0.54
2237 1 2 84 48 90.48 1.19 8.33 0.55
Mean 3.0 3.6 260.4 126.4 88.65 2.57 6.11 0.49
![Page 275: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/275.jpg)
263
APPENDIX G
ORDER OF ESSAYS IN EACH SET
Sequence
of essays
Essay Set 1 Essay Set 2 Essay Set 3
Ann Shelley Sarah James Beth George Judy Tim Esther
1 1025 2063 1015 1135 2078 1160 1205 2237 1152
2 1015 2236 1034 1030 2122 1135 1152 2039 1001
3 1128 2013 1025 1057 2134 1057 1001 2042 1004
4 1034 2124 1128 1112 2074 1112 1183 2163 1205
5 1176 2045 1176 1160 2019 1030 1004 2144 1183
6 2013 1034 2124 2074 1057 2019 2144 1183 2039
7 2124 1128 2045 2122 1160 2134 2042 1152 2144
8 2063 1015 2063 2134 1112 2078 2237 1004 2237
9 2045 1176 2236 2078 1030 2074 2039 1205 2042
10 2236 1025 2013 2019 1135 2122 2163 1001 2163
![Page 276: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/276.jpg)
264
APPENDIX H
EXCERPTS FROM TEACHER THINK-ALOUD VERBAL TRANSCRIPTS
Teacher Name: George
[Essay # 1160]
So I‟m just reading it now, and I‟ve noticed a couple of spelling mistakes that just stick
out -- I‟m looking at the organization of it which I normally do, and get the layout of the
ideas, because that‟s always what I look at first. This student is asking a question, I like
that because it hooks the reader, at the beginning of the introductory paragraph. Ok, I‟m
looking at the introductory paragraph and see a couple questions which are good, because
I think they hook the reader. I always advise students to do that but the challenge is, you
know because it's a very short essay, it‟s only three paragraphs, I don‟t see any sort of
overriding thesis statement or no main, um, statement, outlying his or her argument as to
what he‟s going to say, so I see that as a bit of a weakness in this introductory paragraph.
There is a spelling mistake with dilemma… Okay, and I see a big run-on sentence in the
middle of the paragraph that is distracting me. This is one thing I find challenging with
writing is form versus… you know, if the reader gets distracted by your writing, then the
message is lost, that‟s one thing I would work on with this student, is trying to make the
sentences crisp and clear, often I prefer shorter sentences, I mean, obviously mix the
sentence structure but this is losing me so I would make a note of this middle sentence,
which I just find too long and verbose. This is some good writing. I mean good sentence
structure in here.
I am just reading it again. So, I had a look. So this first paragraph this student has some
good ideas. There‟s a nice, the writing is quite proficient, sort of clear vocabulary that‟s
used and a lot of concrete sentence structure and sophisticated phrasing, I love the second
sentence, on the one hand we all know that education serves a social purpose: we study
in order to acquire the skills and knowledge that will help us perform well in the future,
in a working environment-- what I probably recommend this student do is reorganize,
move some of the sentences in the second paragraph into the introductory paragraph to
keep it framed, because the framing of the paper is important, the visual layout of it, so
the introductory paragraph has the hook for the reader he‟s asking the questions but then
the writer needs to answer those questions quickly and make a statement of his or her
opinion. And I think the point in here is just a matter of reorganizing it, social agents,
third paragraph, sophisticated concept… Okay there‟s some good thoughts in here, I just
read the last paragraph, I‟m thinking, there‟s some good thoughts, it skips around a little
bit and doesn‟t have a strong focus. So what I‟d recommend because the supporting
sentence here about the French students, the last sentence in the second paragraph, large
number of French students, there is pluralization, too, and massive failures of the exam,
making the job much harder, for the drop-outs, so I think what I‟d probably recommend
this writer do, the writing is generally quite good, quite proficient, there‟s not really
many problems, careers on the last page, a spelling error then, have to choose between,
![Page 277: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/277.jpg)
265
so the to again -- the feedback I‟d want to give this writer is the idea of focus, so to have
a stronger focus and to front-load the writing, the focus is in English academic, English
in general, we tend to front-load ideas, so put the most important idea up front, so I liked
the introductory paragraph as it‟s written but it needs some summary of what his or her
main point is under that, and I think the raw material is in the second paragraph, and the
third paragraph can be synthesized, so we need a strong thesis statement, synthesized in
the first paragraph, and then, um, then making sure that following -- the two remaining
paragraphs, one main body paragraph and a concluding paragraph, that the supporting
paragraph, the body really has enough support to prove the statement that student writes
in the introductory statement, and then the concluding statement, reinforces what his or
her argument is and leads the readers, yeah you know I agree with that, so I think this
paper what I‟d really focus on is form and minor editing also, trying to get the writer to
review his or her work, to look for a series of spelling mistakes, a number of spelling
mistakes -- the grammar is great, and just some minor issues along those lines, I mean the
writing is very proficient and powerful, just needs reorganization of ideas.
Follow-up Interview Questions
Researcher: You said the grammar is great. Can you be a bit more specific?
George: Well it is, the sentences, using complex sentence structure, I shouldn‟t presume
it‟s a he but the student is using complex sentence structure, certain multiple, um, sort of
subordinate clauses that a lot of nice connecting phrases like in recent years, for example,
with this in mind, and that shows, this person has that cohesiveness in the writing, they
can connect the material quite easily and make it flow quite nicely, what I probably
suggest is that there was that bit of a run-on sentence in that second paragraph, and this is
where editing comes in to keep the focus clear, once the reader starts getting lost in the
writing in the form, then the message disappears. So what I‟d recommend, the writer put
a colon here, anywhere, with semi-colons and colons I‟ll often see if there‟s a way to
make it into two sentences, make it more concise and efficient, and clear, but what I can
see, looking through it quickly, the grammar seemed quite sophisticated, and the
grammar accuracy seemed quite good.
[Essay # 1135]
I just read the first paragraph, I noticed that there‟s -- the writing is interesting because I
just finished reading the more sophisticated writing, and this is also quite good from what
I see the grammar doesn‟t have any major issues, but there are some sort of sentence
structure issues that I think will be important to address like this last question, but when
it comes to address I‟d say it‟s the first one that‟s important, and that‟s where I
recommend dividing it into two periods, leaving out the but -- saying when it comes to
this question I think it's' hard to say which one is important, period, people should
consider both these. Instead of two things because things is so general, state specifically
what you‟re talking of and use phrases from the prompt. Because it just makes it more
clear, using a word like thing is so vague and make their own choose, here we have sort
of a word form issue, beleif -- spelling mistake -- okay and the subject verb agreement
![Page 278: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/278.jpg)
266
here, someone totally do„es‟ not pay, so the second paragraph -- I read the first and
second paragraphs and then I skimmed to see how the paper is organized and looking at
how the paragraphs begin, the first few phrases in the paragraphs, sometimes what I do to
see the organization and layout of the writing, I think the student has some good
techniques, sort of starting out with my dream, it interests the reader, interesting narrative,
personal anecdote, personal, storytelling approach, and then the framing of the second
paragraph to begin with is great in the second place, to sum up, so that shows the reader
the different steps in the argument, first second and the conclusion. So that is a good
sense of the importance of organization but there are some grammatical issues, subject
verb agreement, someone totally does not pay attention… if someone „is‟ very interested,
so again someone being singular, seems a bit of a problem, the student has to know that
someone is singular and she/he decides so again subject verb agreement, so we have
grammatical accuracy issues, subject verb agreements, some typos, subjests, from this
case, people can see with interest, people may not do well from subjects one day, he or
she may not go back to his or her interesting things and just the last sentence in the
second paragraph, he or she may go, return, or um, yeah so there‟s some awkwardness
around that, there‟s not idiomatic writing -- let me read the second to last paragraph --
since that -- some issue there, although it may appear to be -- so
Okay, there‟s some problems in the second paragraph, some grammatical structural
problems, so what I want to say is I‟m also interested in designing, gerund versus
infinitives, so it's' not boring to me, now when I have time to kill, I will play soccer with
my friends and some phrases here, finding for designing -- so this I don‟t really follow, I
presume she or he -- she probably, the job search… so, issues of word forms, I think,
nouns versus verbs, gerunds versus infinitives, choosing a design in course, and the job
search, vocabulary usage, -- so those people can see the job factors may have an
important effect on your choice -- again another typo… another subject verb agreement,
everything „has‟ their flaws and benefits just like a coin… interesting, okay so I think the
organization of… this paper is quite good, I think the form of the um, layout of it, is quite
promising, it‟s a nice introductory paragraph with a brief thesis statement, when it comes
to this question it‟s hard to say which is important, people should consider both these two
things and carefully make their own chioce. I think that‟s a nice clear statement, and the
student has some good strategies using these personal experiences to reinforce points,
makes it memorable for the reader, there is some, major, spelling mistakes, throughout…
and there are also some issues with subject verb agreement and some issues with
sentence structure… um, and um, verb forms, and word forms… but I mean they have a
sense of proper sentence structure. What I would recommend with this student is to focus
on these specific areas, like subject verb agreement, word forms, they have the flow of
the writing it‟s the accuracy that has to be improved but I think it‟s quite good it just
needs some fine turning and editing in those areas.
![Page 279: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/279.jpg)
267
APPENDIX I
THE EDD CHECKLIST
Essay number:
1. This essay answers the question. Yes No
2. This essay is written clearly enough to be read without having to guess what the writer is trying to say.
Yes No
3. This essay is concisely written and contains few redundant ideas or linguistic expressions.
Yes No
4. This essay contains a clear thesis statement. Yes No
5. The main arguments of this essay are strong. Yes No
6. There are enough supporting ideas and examples in this essay. Yes No
7. The supporting ideas and examples in this essay are appropriate and logical.
Yes No
8. The supporting ideas and examples in this essay are specific and detailed.
Yes No
9. The ideas are organized into paragraphs and include an introduction, a body, and a conclusion.
Yes No
10. Each body paragraph has a clear topic sentence tied to supporting sentences.
Yes No
11. Each paragraph presents one distinct and unified idea. Yes No
12. Each paragraph is connected to the rest of the essay. Yes No
13. Ideas are developed or expanded well throughout each paragraph. Yes No
14. Transition devices are used effectively. Yes No
15. This essay demonstrates syntactic variety, including simple, compound, and complex sentence structures.
Yes No
16. This essay demonstrates an understanding of English word order. Yes No
17. This essay contains few sentence fragments. Yes No
18. This essay contains few run-on sentences or comma splices. Yes No
19. Grammatical or linguistic errors in this essay do not impede comprehension.
Yes No
20. Verb tenses are used appropriately. Yes No
21. There is consistent subject-verb agreement. Yes No
22. Singular and plural nouns are used appropriately. Yes No
23. Prepositions are used appropriately. Yes No
24. Articles are used appropriately. Yes No
25. Pronouns agree with referents. Yes No
![Page 280: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/280.jpg)
268
Appendix I (Continued)
26. Sophisticated or advanced vocabulary is used. Yes No
27. A wide range of vocabulary is used. Yes No
28. Vocabulary choices are appropriate for conveying the intended meaning.
Yes No
29. This essay demonstrates facility with appropriate collocations. Yes No
30. Word forms (noun, verb, adjective, adverb, etc) are used appropriately.
Yes No
31. Words are spelled correctly. Yes No
32. Punctuation marks are used appropriately. Yes No
33. Capital letters are used appropriately. Yes No
34. This essay contains appropriate indentation. Yes No
35. Appropriate tone and register are used throughout the essay. Yes No
![Page 281: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/281.jpg)
269
APPENDIX J
THE EDD CHECKLIST WITH CONFIDENCE LEVEL
Essay number:
1. This essay answers the question. Yes No ( %)
2. This essay is written clearly enough to be read without having to guess what the writer is trying to say.
Yes No ( %)
3. This essay is concisely written and contains few redundant ideas or linguistic expressions.
Yes No ( %)
4. This essay contains a clear thesis statement. Yes No ( %)
5. The main arguments of this essay are strong. Yes No ( %)
6. There are enough supporting ideas and examples in this essay.
Yes No ( %)
7. The supporting ideas and examples in this essay are appropriate and logical.
Yes No ( %)
8. The supporting ideas and examples in this essay are specific and detailed.
Yes No ( %)
9. The ideas are organized into paragraphs and include an introduction, a body, and a conclusion.
Yes No ( %)
10. Each body paragraph has a clear topic sentence tied to supporting sentences.
Yes No ( %)
11. Each paragraph presents one distinct and unified idea. Yes No ( %)
12. Each paragraph is connected to the rest of the essay. Yes No ( %)
13. Ideas are developed or expanded well throughout each paragraph.
Yes No ( %)
14. Transition devices are used effectively. Yes No ( %)
15. This essay demonstrates syntactic variety, including simple, compound, and complex sentence structures.
Yes No ( %)
16. This essay demonstrates an understanding of English word order.
Yes No ( %)
17. This essay contains few sentence fragments. Yes No ( %)
18. This essay contains few run-on sentences or comma splices. Yes No ( %)
19. Grammatical or linguistic errors in this essay do not impede comprehension.
Yes No ( %)
20. Verb tenses are used appropriately. Yes No ( %)
21. There is consistent subject-verb agreement. Yes No ( %)
22. Singular and plural nouns are used appropriately. Yes No ( %)
23. Prepositions are used appropriately. Yes No ( %)
![Page 282: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/282.jpg)
270
Appendix J (Continued)
24. Articles are used appropriately. Yes No ( %)
25. Pronouns agree with referents. Yes No ( %)
26. Sophisticated or advanced vocabulary is used. Yes No ( %)
27. A wide range of vocabulary is used. Yes No ( %)
28. Vocabulary choices are appropriate for conveying the intended meaning.
Yes No ( %)
29. This essay demonstrates facility with appropriate collocations.
Yes No ( %)
30. Word forms (noun, verb, adjective, adverb, etc) are used appropriately.
Yes No ( %)
31. Words are spelled correctly. Yes No ( %)
32. Punctuation marks are used appropriately. Yes No ( %)
33. Capital letters are used appropriately. Yes No ( %)
34. This essay contains appropriate indentation. Yes No ( %)
35. Appropriate tone and register are used throughout the essay. Yes No ( %)
![Page 283: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/283.jpg)
271
APPENDIX K
ASSESSMENT GUIDELINES I (FOR THE PILOT STUDY)
Dear Teachers,
Thank you so much for your interest in my doctoral dissertation study. I am
conducting a study that examines an effective way to provide diagnostic feedback to ESL
writers on a timed essay test. Due to the complex nature of second language writing, ESL
learners need to be well-informed about the strengths and weaknesses in their writing.
Despite the interest in and need for a diagnostic approach in ESL writing instruction and
assessment, however, little is known about what kind of linguistic skills or strategies
must be diagnosed, and in what ways. It is thus critical to have the opinions of ESL
writing teachers as a source of accurate diagnostic feedback to ESL learners. The
information they provide will ultimately result in a detailed diagnostic description that
can be tailored to and made available to individual ESL writers.
Over the past few months, I have worked with nine ESL writing teachers to
develop a diagnostic assessment scheme. The teachers were invited to a think-aloud
session in which they verbally reported their thinking processes while providing
diagnostic feedback on 10 ESL timed essays. The essays were written by adult ESL
learners with a wide range of English proficiency levels in a large-scale testing setting
within 30 minutes. The verbal data that the teachers provided were analyzed, and
emerging themes were coded. Thirty-nine separate themes were identified, each
consisting of one descriptor of ESL academic writing. These 39 descriptors were then
reviewed by four PhD students specializing in ESL writing. The outcome of these experts‟
review resulted in the deletion of four descriptors. Using the remaining 35 descriptors, I
have created a diagnostic assessment scheme, called “Empirically-derived Descriptor-
based Diagnostic (EDD) checklist.”
Now, I would like you to mark the enclosed essays using the EDD checklist.
Before marking the essays, please read the EDD checklist carefully and internalize it.
You will be asked to answer yes or no to each descriptor in relation to each essay. I
understand that it is not easy to determine the cut-off of yes or no. If you think a writer
generally meets the criteria of the descriptor, it should be considered a yes. Otherwise, it
is considered a no. The term generally indicates the state in which you do not feel
distracted or your comprehension is not compromised by a student‟s mistake on the skill
being assessed. When you make this decision for each descriptor, please specify your
confidence level in the blank box next to Yes No (please specify your confidence
level on 10 essays [i.e., 5 essays × 2 prompts]). If you are extremely confident in using
the descriptor, your confidence level will be 100%. On the other hand, if you are not
confident at all in answering yes or no to a descriptor, then your confidence level will be
0%. You can specify your confidence level anywhere along the continuum between 0%
![Page 284: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/284.jpg)
272
and 100% (e.g., 30%, 50%, 70%, etc).
Below, I will explain the meaning of some descriptors that might cause
confusion to you. I selected only a few descriptors to help your understanding; however,
if there is anything that you are not sure of, please do not hesitate to let me know.
1. This essay answers the question.
: If a writer addresses a topic that is not relevant to the given question, or does
not respond to the specific instructions in the prompt, he or she would not
satisfy this descriptor.
6. There are enough supporting ideas and examples in this essay.
: As long as a writer presents a minimum of two supporting ideas and examples
in his or her essay, he or she would satisfy this descriptor.
9. The ideas are organized into paragraphs and include an introduction, a body,
and a conclusion.
: If a writer does not include an introduction, a body, and (not or) a conclusion
in his or her essay, he or she would not satisfy this descriptor.
15. This essay demonstrates syntactic variety.
: If a writer demonstrates the ability to use a variety of syntactic structures
including simple, compound, and complex sentences, he or she would satisfy
this descriptor.
27. A wide range of vocabulary is used.
: If a writer uses a broad range of vocabulary and varied synonyms, he or she
would satisfy this descriptor. If, on the other hand, a writer uses the same
words repeatedly, he or she would not satisfy this descriptor.
28. Vocabulary choices are appropriate for conveying the intended meaning.
: If a writer employs inappropriate word choices without knowing the accurate
meaning of the words, he or she would not satisfy this descriptor. For example,
if an essay reads “study extravagant subjects,” the writer obviously does not
know the accurate meaning of „extravagant.‟
29. This essay demonstrates facility with appropriate collocations.
: If a writer uses collocations inappropriately, he or she would not satisfy this
descriptor. For example, if an essay reads “a person does a decision” instead
of “a person makes a decision,” the writer would not satisfy this descriptor. In
addition, if an essay shows awkward word-for-word translations, the writer
would not satisfy this descriptor.
![Page 285: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/285.jpg)
273
30. Word forms (noun, verb, adjective, adverb, etc) are used appropriately.
: If an essay reads “Canada is safety” instead of “Canada is safe,” the writer
would not satisfy this descriptor.
32. Punctuation marks are used appropriately.
: If a writer does not use punctuation marks (i.e., commas, full stops, colons,
question marks, quotation marks, etc) appropriately, he or she would not
satisfy this descriptor. For example, if a writer uses a comma in the wrong
place or does not know how to use colons correctly, he or she would not
satisfy this descriptor.
35. Appropriate tone and register are used throughout the essay.
: If a writer does not employ appropriate academic tone and register, he or she
would not satisfy this descriptor. For example, “in a nutshell” is too colloquial
to be used in academic writing.
When you mark the essays using the EDD checklist, please do not forget to write
down the number of the essay that you are marking on the EDD checklist.
If you have any questions about the EDD checklist or any other concerns, please
do not hesitate to contact me at [email protected] Thank you again for your
support for the study.
Sincerely,
Youn-Hee Kim
Ph.D. candidate, Second Language Education
Department of Curriculum, Teaching and Learning
Ontario Institute for Studies in Education, University of Toronto
Email: [email protected]
![Page 286: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/286.jpg)
274
APPENDIX L
ASSESSMENT GUIDELINES II (FOR THE MAIN STUDY)
Dear Teachers,
Thank you so much for your interest in my doctoral dissertation study. I am
conducting a study that examines an effective way to provide diagnostic feedback to ESL
writers on a timed essay test. Due to the complex nature of second language writing, ESL
learners need to be well-informed about the strengths and weaknesses in their writing.
Despite the interest in and need for a diagnostic approach in ESL writing instruction and
assessment, however, little is known about what kind of linguistic skills or strategies
must be diagnosed, and in what ways. It is thus critical to have the opinions of ESL
writing teachers as a source of accurate diagnostic feedback to ESL learners. The
information they provide will ultimately result in a detailed diagnostic description that
can be tailored to and made available to individual ESL writers.
Over the past few months, I have worked with nine ESL writing teachers to
develop a diagnostic assessment scheme. The teachers were invited to a think-aloud
session in which they verbally reported their thinking processes while providing
diagnostic feedback on 10 ESL timed essays. The essays were written by adult ESL
learners with a wide range of English proficiency levels in a large-scale testing setting
within 30 minutes. The verbal data that the teachers provided were analyzed, and
emerging themes were coded. Thirty-nine separate themes were identified, each
consisting of one descriptor of ESL academic writing. These 39 descriptors were then
reviewed by four PhD students specializing in ESL writing. The outcome of these experts‟
review resulted in the deletion of four descriptors. Using the remaining 35 descriptors, I
have created a diagnostic assessment scheme, called “Empirically-derived Descriptor-
based Diagnostic (EDD) checklist.”
Now, I would like you to mark the enclosed essays using the EDD checklist.
Before marking the essays, please read the EDD checklist carefully and internalize it.
You will be asked to answer yes or no to each descriptor in relation to each essay. I
understand that it is not easy to determine the cut-off of yes or no. If you think a writer
generally meets the criteria of the descriptor, it should be considered a yes. Otherwise, it
is considered a no. The term generally indicates the state in which you do not feel
distracted or your comprehension is not compromised by a student‟s mistake on the skill
being assessed. When you make this decision for each descriptor, please specify your
confidence level in the blank box next to Yes No (please specify your confidence
level on 10 essays [i.e., 5 essays × 2 prompts]). If you are extremely confident in using
the descriptor, your confidence level will be 100%. On the other hand, if you are not
confident at all in answering yes or no to a descriptor, then your confidence level will be
0%. You can specify your confidence level anywhere along the continuum between 0%
![Page 287: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/287.jpg)
275
and 100% (e.g., 30%, 50%, 70%, etc).
Below, I will explain the meaning of some descriptors that might cause
confusion to you. I selected only a few descriptors to help your understanding; however,
if there is anything that you are not sure of, please do not hesitate to let me know.
1. This essay answers the question.
: If a writer addresses a topic that is not relevant to the given question, or does
not respond to the specific instructions in the prompt, he or she would not
satisfy this descriptor.
6. There are enough supporting ideas and examples in this essay.
: As long as a writer presents a minimum of two supporting ideas and examples
in his or her essay, he or she would satisfy this descriptor.
9. The ideas are organized into paragraphs and include an introduction, a body,
and a conclusion.
: If a writer does not include an introduction, a body, and (not or) a conclusion
in his or her essay, he or she would not satisfy this descriptor.
15. This essay demonstrates syntactic variety.
: If a writer demonstrates the ability to use a variety of syntactic structures
including simple, compound, and complex sentences, he or she would satisfy
this descriptor.
27. A wide range of vocabulary is used.
: If a writer uses a broad range of vocabulary and varied synonyms, he or she
would satisfy this descriptor. If, on the other hand, a writer uses the same
words repeatedly, he or she would not satisfy this descriptor.
28. Vocabulary choices are appropriate for conveying the intended meaning.
: If a writer employs inappropriate word choices without knowing the accurate
meaning of the words, he or she would not satisfy this descriptor. For example,
if an essay reads “study extravagant subjects,” the writer obviously does not
know the accurate meaning of „extravagant.‟
29. This essay demonstrates facility with appropriate collocations.
: If a writer uses collocations inappropriately, he or she would not satisfy this
descriptor. For example, if an essay reads “a person does a decision” instead
of “a person makes a decision,” the writer would not satisfy this descriptor. In
addition, if an essay shows awkward word-for-word translations, the writer
would not satisfy this descriptor.
![Page 288: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/288.jpg)
276
30. Word forms (noun, verb, adjective, adverb, etc) are used appropriately.
: If an essay reads “Canada is safety” instead of “Canada is safe,” the writer
would not satisfy this descriptor.
32. Punctuation marks are used appropriately.
: If a writer does not use punctuation marks (i.e., commas, full stops, colons,
question marks, quotation marks, etc) appropriately, he or she would not
satisfy this descriptor. For example, if a writer uses a comma in the wrong
place or does not know how to use colons correctly, he or she would not
satisfy this descriptor.
34. This essay contains appropriate indentation.
: If a writer does not use approximately five to seven spaces to indent the first
sentence of each paragraph, he or she would not satisfy this descriptor.
35. Appropriate tone and register are used throughout the essay.
: If a writer does not employ appropriate academic tone and register, he or she
would not satisfy this descriptor. For example, “in a nutshell” is too colloquial
to be used in academic writing.
Please take note of the following points:
(1) When you determine what constitutes „few‟ on descriptors 3, 17, and 18,
consider how noticeable the linguistic errors are. For example, if you find
that fragmentary sentences draw your attention, the essay would not satisfy
descriptor 17.
3. This essay is concisely written and contains few redundant ideas or
linguistic expressions.
17. This essay contains few sentence fragments.
18. This essay contains few run-on sentence fragments.
(2) There are fundamental differences between descriptors 6, 7, and 8.
6. There are enough supporting ideas and examples in this essay.
7. The supporting ideas and examples in this essay are appropriate and
logical.
8. The supporting ideas and examples in this essay are specific and detailed.
(3) There is also a difference between descriptors 2 and 19:
2. This essay is written clearly enough to be read without having to guess
what the writer is trying to say.
![Page 289: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/289.jpg)
277
19. Grammatical or linguistic errors in this essay do not impede
comprehension.
: While descriptor 2 indicates that an essay might not be read easily for
many reasons (e.g., poor organization, poor content, or linguistic errors),
descriptor 19 focuses primarily on grammatical or linguistic errors that
impede comprehension.
(4) When determining the degree of „vocabulary sophistication‟ and „vocabulary
breadth‟ on descriptors 26 and 27, consider the context in which the essays
were written. These essays were written by adult ESL students who wish to
be admitted to a college/university or a graduate school in English-speaking
countries.
26. Sophisticated or advanced vocabulary is used.
27. A wide-range of vocabulary is used.
(5) Also, please pay attention to the slight difference between descriptors 9 and
34.
9. The ideas are organized into paragraphs and include an introduction, a
body, and a conclusion.
34. This essay contains appropriate indentation.
: While descriptor 9 focuses on whether a writer is able to organize his
or her ideas into paragraphs using an appropriate essay structure (i.e.,
introduction, body, and conclusion), descriptor 34 asks whether a writer
has indented the first sentence of each paragraph to make a visual
distinction between the paragraphs.
(6) If a writer does not employ the relevant linguistic features, please do not
mark the yes or no box.
14. Transition devices are used effectively.
: If a writer does not employ transition devices at all, please do not
mark the yes or no box.
29. This essay demonstrates facility with appropriate collocations.
: If a writer does not employ collocations at all, please do not mark the
yes or no box.
When you mark the essays using the EDD checklist, please do not forget to write
down the number of the essay that you are marking on the EDD checklist.
If you have any questions about the EDD checklist or any other concerns, please
do not hesitate to contact me at [email protected] Thank you again for your
![Page 290: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/290.jpg)
278
support for the study.
Sincerely,
Youn-Hee Kim
Ph.D. candidate, Second Language Education
Department of Curriculum, Teaching and Learning
Ontario Institute for Studies in Education, University of Toronto
Email: [email protected]
![Page 291: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/291.jpg)
279
APPENDIX M
CORRELATIONS BETWEEN ETS SCORES AND TEACHER SCORES
Table M-1
Correlation Matrix of Essay Set 1
ETS Ann Shelley Sarah
ETS 1.00
Ann .82**
1.00
Shelley .84**
.80**
1.00
Sarah .89 .90**
.79**
1.00
** indicates p < .01
Table M-2
Correlation Matrix of Essay Set 2
ETS James Beth George
ETS 1.00
James .88**
1.00
Beth .86**
.76* 1.00
George .77**
.75* .76
* 1.00
** indicates p < .01,
* indicates p < .05
Table M-3
Correlation Matrix of Essay Set 3
ETS Judy Tim Esther
ETS 1.00
Judy .98**
1.00
Tim .98**
.96**
1.00
Esther .95**
.92**
.93**
1.00
** indicates p < .01
Note. The correlation coefficient found in Essay Set 3, which contained shorter essays, was greater than
those in Essay Sets 1 and 2. Further research is recommended about the relationship between essay length
and the magnitude of correlation coefficients.
![Page 292: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/292.jpg)
280
APPENDIX N
DESCRIPTOR MEASURE STATISTICS
Descriptor Obsvd
Average
Fair-M
Average
Measure
(logits)
Model
S.E.
Infit
MnSq
Outfit
MnSq
Corr.
PtBis
D01 0.6 0.59 0.10 0.15 1.17 1.28 0.11
D02 0.6 0.58 0.14 0.15 0.86 0.84 0.36
D03 0.5 0.46 0.64 0.15 0.90 0.90 0.33
D04 0.6 0.60 0.05 0.15 0.99 1.06 0.27
D05 0.4 0.35 1.06 0.15 0.92 0.88 0.29
D06 0.4 0.41 0.84 0.15 0.94 0.91 0.29
D07 0.5 0.46 0.64 0.15 0.93 0.87 0.31
D08 0.4 0.42 0.79 0.15 0.92 0.89 0.31
D09 0.7 0.75 -0.64 0.17 1.12 1.19 0.16
D10 0.4 0.41 0.84 0.15 0.93 0.93 0.30
D11 0.6 0.59 0.10 0.15 0.95 0.92 0.31
D12 0.7 0.69 -0.33 0.16 0.97 0.93 0.28
D13 0.6 0.61 0.02 0.15 0.99 0.97 0.27
D14 0.4 0.43 0.73 0.15 1.06 1.10 0.21
D15 0.7 0.67 -0.25 0.16 0.85 0.79 0.39
D16 0.9 0.88 -1.55 0.21 0.96 0.89 0.22
D17 0.6 0.64 -0.09 0.15 0.93 0.91 0.32
D18 0.6 0.55 0.25 0.15 1.19 1.28 0.10
D19 0.5 0.47 0.59 0.15 0.89 0.86 0.32
D20 0.6 0.67 -0.23 0.16 1.08 1.09 0.18
D21 0.8 0.82 -1.07 0.18 1.03 0.99 0.21
D22 0.8 0.84 -1.19 0.19 1.12 1.27 0.12
D23 0.7 0.69 -0.35 0.16 1.02 1.02 0.21
D24 0.6 0.66 -0.18 0.16 1.16 1.23 0.12
D25 0.8 0.83 -1.12 0.18 0.94 0.99 0.25
D26 0.3 0.28 1.41 0.16 0.92 0.85 0.30
D27 0.5 0.46 0.64 0.15 0.87 0.83 0.36
D28 0.7 0.71 -0.43 0.16 1.03 1.06 0.20
D29 0.4 0.38 0.93 0.15 0.85 0.79 0.36
D30 0.6 0.58 0.13 0.15 1.04 1.04 0.20
D31 0.5 0.50 0.46 0.15 1.09 1.12 0.17
D32 0.6 0.62 -0.04 0.15 0.98 0.99 0.26
D33 0.8 0.85 -1.23 0.19 1.05 1.21 0.17
D34 0.6 0.57 0.17 0.15 1.27 1.35 0.06
D35 0.9 0.91 -1.82 0.22 1.04 1.29 0.10
![Page 293: An Argument-Based Validity Inquiry into the Empirically ... · This study built and supported arguments for the use of diagnostic assessment in English as a second language (ESL)](https://reader031.vdocuments.net/reader031/viewer/2022011920/60241a5076274865387c78e5/html5/thumbnails/293.jpg)
281
APPENDIX O
THE INITIAL Q-MATRIX
Descriptor CON ORG GRM VOC MCH
D01 1 0 0 0 0
D02 1 1 0 0 0
D03 1 1 0 1 0
D04 1 1 0 0 0
D05 1 1 0 0 0
D06 1 0 0 0 0
D07 1 1 0 0 0
D08 1 0 0 0 0
D09 0 1 0 0 0
D10 0 1 0 0 0
D11 1 1 0 0 0
D12 0 1 0 0 0
D13 1 1 0 0 0
D14 0 1 1 1 0
D15 0 0 1 0 0
D16 0 0 1 0 0
D17 0 0 1 0 1
D18 0 0 1 0 1
D19 0 0 1 0 0
D20 0 0 1 0 0
D21 0 0 1 0 0
D22 0 0 1 0 0
D23 0 0 1 0 0
D24 0 0 1 0 0
D25 0 0 1 0 0
D26 0 0 0 1 0
D27 0 0 0 1 0
D28 0 0 0 1 0
D29 0 0 1 1 0
D30 0 0 1 1 0
D31 0 0 1 0 1
D32 0 0 0 0 1
D33 0 0 1 0 1
D34 0 1 0 0 1
D35 1 1 1 1 1