an argument-based validity inquiry into the empirically ... · this study built and supported...

An Argument-Based Validity Inquiry into the Empirically-Derived Descriptor-

Based Diagnostic (EDD) Assessment in ESL Academic Writing

by

Youn-Hee Kim

A thesis submitted in conformity with the requirements

for the degree of Doctor of Philosophy

Department of Curriculum, Teaching and Learning

Ontario Institute for Studies in Education

University of Toronto

© Copyright by Youn-Hee Kim (2010)

ii

An Argument-Based Validity Inquiry into the Empirically-Derived Descriptor-

Based Diagnostic (EDD) Assessment in ESL Academic Writing

Doctor of Philosophy (2010)

Youn-Hee Kim


University of Toronto

Abstract

This study built and supported arguments for the use of diagnostic assessment in

English as a second language (ESL) academic writing. In the two-phase study, a new

diagnostic assessment scheme, called the Empirically-derived Descriptor-based

Diagnostic (EDD) checklist, was developed and validated for use in small-scale

classroom assessment. The checklist assesses ESL academic writing ability using

empirically-derived evaluation criteria and estimates skill parameters in a way that

overcomes the problems associated with the number of items in diagnostic models.

Interpretations of and uses for the EDD checklist were validated using five assumptions:

(a) that the empirically-derived diagnostic descriptors that make up the EDD checklist

are relevant to the construct of ESL academic writing; (b) that the scores derived from

the EDD checklist are generalizable across different teachers and essay prompts; (c) that

performance on the EDD checklist is related to performance on other measures of ESL

academic writing; (d) that the EDD checklist provides a useful diagnostic skill profile for

ESL academic writing; and (e) that the EDD checklist helps teachers make appropriate

diagnostic decisions and has the potential to positively impact teaching and learning ESL

academic writing.

Using a mixed-methods research design, four ESL writing experts created the

EDD checklist from 35 descriptors of ESL academic writing. These descriptors had been

iii

elicited from nine ESL teachers‟ think-aloud verbal protocols, in which they provided

diagnostic feedback on ESL essays. Ten ESL teachers utilized the checklist to assess 480

ESL essays and were interviewed about its usefulness. Content reviews from ESL writing

experts and statistical dimensionality analyses determined that the underlying structure of

the EDD checklist consists of five distinct writing skills: content fulfillment,

organizational effectiveness, grammatical knowledge, vocabulary use, and mechanics.

The Reduced Reparameterized Unified Model (Hartz, Roussos, & Stout, 2002) then

demonstrated the diagnostic quality of the checklist and produced fine-grained writing

skill profiles for individual students. Overall teacher evaluation further justified the

validity claims for the use of the checklist. The pedagogical implications of the use of

diagnostic assessment in ESL academic writing were discussed, as were the contributions

that it would make to the theory and practice of second language writing instruction and

assessment.

iv

Acknowledgements

Looking back on my life as a graduate student at OISE/UT, I feel that I was most

fortunate to have had the opportunity to grow and develop as an academic. My PhD

program not only provided me with intellectual knowledge regarding second language

education and educational measurement, but also transformed, shifted, and nurtured my

beliefs, thoughts, and values. This invaluable learning experience was made possible by

the constant guidance of many people.

First of all, I would like to express my deep gratitude to my dissertation

supervisor, Dr. Eunice Jang, for her enormous support and encouragement. Dr. Eunice

Jang was a wonderful academic advisor and the strongest supporter of my research. She

was always available when I needed her and would sit for hours, patiently listening to me,

inspiring me to think deeper, and guiding me in the most advantageous direction. It was

also a privilege to work with her on numerous language assessment projects. Without her,

my four-year PhD journey would not have been quite so fulfilling and rewarding.

My appreciation also goes to Dr. Ruth Childs. Her expertise in educational

measurement was an invaluable resource, and her advice, insight, and enthusiasm

inspired me to complete this research project. Although I took numerous statistics and

educational measurement courses with her, I already miss her Test Theory course. She

demonstrated that statistical concepts are not necessarily complex and can be easily

applied to other research inquiries. I would also like to thank her for inviting me to her

Datahost laboratory meetings, where I was able to meet outstanding psychometrician

colleagues.

Words cannot sufficiently express my gratitude to Dr. Sharon Lapkin. She was

always there when I needed her desperately, giving me unselfish support and

encouragement. Her commitment and enthusiasm to second language acquisition and

learning research was also inspirational and something that I wish to forever emulate. I

was most fortunate to be around her, as she was an excellent role model to many graduate

students. I will never forget her unwavering support and encouragement.

This research would not have been possible without the generous and kind

assistance of many people. I am especially thankful to the ESL teachers who participated

v

in the study. They spent many hours marking essays and proposing ways to develop a

more effective assessment scheme. I would also like to thank Mohammed Al-Alawi,

Seung Won Jun, Robert Kohls, and Jennifer Wilson for their time and insightful

suggestions to my study. Thanks are also due to my friends and colleagues in Modern

Language Centre at OISE/UT for their friendship. I am sincerely thankful to Khaled

Barkaoui, Seung Won Jun, Eun-Yong Kim, Robert Kohls, Geoff Lawrence, Hyunjung

Shin, Wataru Suzuki, Yasuyo Tomita, and Jennifer Wilson.

I would also like to acknowledge that this research project was fully supported

by The International Research Foundation for English Language Education (TIRF)

Doctoral Dissertation Grant and TOEFL Small Grants for Doctoral Research in Second

or Foreign Language Assessment. I am also deeply grateful for the financial support by

OISE/UT, which enabled me to continue my PhD program in Toronto for several years.

My appreciation also goes to Mr. Jaewoon Choi, the former principal of Daegu

Foreign Language High School in Korea. My memory of him dates back ten years when

I worked as an English teacher at the school. One day, he took me out for dinner and

asked me what it was like to be an English teacher. The conversation that we had

reestablished my vision as an educator and motivated me to pursue a higher education.

Without that thought-provoking moment, I would not have ever dreamed of pursuing a

graduate degree. I miss his intellect and insights and hope that our paths will cross

someday.

I am also deeply indebted to Daegu Foreign Language High School and Daegu

Metropolitan Office of Education for their unwavering support during my leave of

absence. I would like to sincerely thank to Principal Sung-Whan Choi, Vice Principal

Sang-Ho Soh, former Vice Principal Young-Ok Noh, Mr. Jaehan Bae, and many other

English teachers at the school. I am also grateful to Mr. Young-Mok Nam at Daegu

Metropolitan Office of Education.

My warmest thanks go out to my family in Korea. It would not have been

possible for me to complete my PhD program without their love, patience, and

understanding. My uncle and aunt in Chicago also deserve my deepest gratitude. I cannot

forget the summer of 2006 when they crossed the border with their car packed with my

many belongings. Thanks to their help, I was able to settle in Toronto without difficulty. I

vi

am also sincerely thankful to my mother, who made long distance calls every morning to

remind me that she stood by me and loved me. Our daily dialogues meant much more to

me than mere words and they remain a happy memory. I dedicate this dissertation to her.

vii

TABLE OF CONTENTS

CHAPTER 1 INTRODUCTION…………………………………………………….........1

Overview of the Research Problem………………………………………………1

Argument-Based Approaches to Validity………………………………………...5

Overarching Research Framework……………………………………………10

Research Questions……………………………………………………………..12

Significance of the Study……………………………………………………….13

Chapter Overview………………………………………………………………14

CHAPTER 2 REVIEW OF LITERATURE……………………………………………..15

Approaches to L2 Writing Assessment…………………………………………15

Approaches to Diagnostic Assessment………………………………………….47

CHAPTER 3 METHODOLOGY………………………………………………………..57

Research Questions……………………………………………………………..57

Research Design Overview……………………………………………………..57

Participants……………………………………………………………………...62

Instruments……………………………………………………………………...64

Data Collection and Analysis Procedures………………………………………67

Summary………………………………………………………………………..85

CHAPTER 4 DEVELOPMENT OF THE EDD CHECKLIST…………………………87

Introduction……………………………………………………………………..87

Identification of EDD Descriptors……………………………………………...87

Characteristics of EDD Descriptors……………………………………….......122

Refinement of EDD Descriptors………………………………………………125

Summary………………………………………………………………………129

CHAPTER 5 PRELIMINARY EVALUATION OF THE EDD CHECKLIST………...130

Introduction……………………………………………………………………130

viii

Teacher and Essay Prompt Effects…………………………………………….130

Correlation between EDD and TOEFL Scores………………………………..141

Teacher Perceptions and Evaluations………………………………………….141

Summary………………………………………………………………………148

CHAPTER 6 PRIMARY EVALUATION OF THE EDD CHECKLIST………………149

Introduction……………………………………………………………………149

Characteristics of the Diagnostic ESL Academic Writing Skill Profiles……149

Correlation between EDD and TOEFL Scores………………………………..176

Teacher Perceptions and Evaluations………………………………………….177

Summary………………………………………………………………………195

CHAPTER 7 SYNTHESIS…………………………………………………………….196

Introduction……………………………………………………………………196

Validity Assumptions Revisited……………………………………………….196

Implications……………………………………………………………………208

Suggestions for Future Research………………………………………………216

REFERENCES…………………………………………………………………………219

APPENDIX A Definitions of Key Terms………………………………………………239

APPENDIX B ESL Teacher Profile……………………………………………………241

APPENDIX C Guidelines for a Think-aloud Session………………………….............242

APPENDIX D Teacher Questionnaire…………………………………………………247

APPENDIX E Guiding Interview Questions for Teachers…………………..................258

APPENDIX F Textual Characteristics of the Three Essay Sets………..........................260

APPENDIX G Order of Essays in Each Set……………………………………………263

APPENDIX H Excerpts from Teacher Think-aloud Transcripts………………………264

APPENDIX I The EDD Checklist……………………………………………………267

APPENDIX J The EDD Checklist With Confidence Level……………………............269

APPENDIX K Assessment Guidelines I………………………………………………271

APPENDIX L Assessment Guidelines II………………………………………………274

ix

APPENDIX M Correlations Between ETS Scores and Teacher Scores………………279

APPENDIX N Descriptor Measure Statistics…………………………………………280

APPENDIX O The Initial Q-Matrix……………………………………………………281

x

LIST OF TABLES

Table Page

1 Synthesis of Writing Construct Elements……………………………………….28

2 Research Design Summary……………………………………………………59

3 The Four Largest First Language Groups………………………………………62

4 Distribution of Test-Takers by Language Groups………………………………63

5 Profile of ESL Writing Experts…………………………………………………64

6 Score Distribution of the TOEFL iBT Independent Essays……………………65

7 Score Distribution of the Three Essay Sets……………………………………67

8 Volume of the Teachers‟ Think-aloud Transcripts………………………………70

9 Distribution of Essay Batches in the Pilot Study………………………………74

10 Distribution of Essay Batches in the Main Study………………………………78

11 39 Descriptors of ESL Academic Writing Skills………………………………89

12 Inter-Coder Reliability for the 39 Descriptors…………………………………..91

13 Frequency of Descriptors by Teachers and Essay Sets…………………………93

14 Refined 35 EDD Descriptors…………………………………………………127

15 FACETS Data Summary………………………………………………………131

16 Distribution of Unexpected Responses across Teachers………………………131

17 Teacher Measure Statistics……………………………………………………135

18 Teacher Effect…………………………………………………………………137

19 Teacher Agreement on Descriptors……………………………………………139

20 Interactions between Teachers and Descriptors………………………………140

21 Teacher Confidence (%) on the Subject Prompt………………………………143

22 Teacher Confidence (%) on the Cooperation Prompt…………………………144

23 Experts‟ Descriptor Classification……………………………………………152

24 Descriptor Clusters Identified by DETECT……………………………………154

25 Confirmatory DIMTEST Results………………………………………………156

26 Initial Descriptor Parameter Estimates…………………………………………158

27 The Final Descriptor Parameter Estimates……………………………………160

28 Descriptors with Poor Diagnostic Power………………………………………164

29 Consistency Indices of Skill Classification…………………………………….168

30 Proportion of Incorrect Patterns Classified by the Number of Skills…………168

31 Case Profiles……………………………………………………………………175

xi

LIST OF FIGURES

Figure Page

1 A general procedure for EBB scale development……………………………41

2 FACET variable map………………………………………………………134

3 The scatter plot for teacher agreement and confidence………………………145

4 CCPROX/HCA results………………………………………………………155

5 Density, time-series, and autocorrelation plots for pMCH………………………..157

6 Density, time-series, and autocorrelation plots for r ………………………..157

7 Proportion of skill masters (pk)………………………………………………161

8 Observed and predicted score distributions…………………………………162

9 The relationship between the number of mastered skills and the total

scores…………………………………………………………………………163

10 Performance difference between descriptor masters and non-masters………163

11 Classification of skill mastery………………………………………………165

12 Distribution of the number of mastered skills………………………………166

13 The most common skill mastery pattern in each number of skill mastery

categories……………………………………………………………………….167

14 Proportion of masters for the subject and cooperation prompts……………169

15 The most common skill mastery patterns for the subject prompt……………170

16 The most common skill mastery patterns for the cooperation prompt………170

17 Number of mastered skills for the subject and cooperation prompts………171

18 Proportion of masters across different proficiency levels……………………172

19 Proportion of masters across different proficiency levels for the subject

prompt……………………………………………………………………..……173

20 Proportion of masters across different proficiency levels for the cooperation

prompt…………………………………………………………………………174

21 Number of mastered skills across different proficiency levels………………175

22 An example of the diagnostic ESL writing profile…………………………213

*

2.2

xii

To My Mother

1

CHAPTER 1

INTRODUCTION

Overview of the Research Problem

Responding to students‟ writing is an important aspect of second language (L2)

writing programs that is fundamentally concerned with the successful development of

their L2 writing skills. Teachers spend a substantial amount of time providing appropriate

feedback about students‟ strengths and weaknesses, which students incorporate into their

studies going forward. The significance of feedback has been emphasized in process-

oriented writing instruction, where students have the freedom to revise and resubmit

multiple drafts of their work (Ferris, 2003). As interest in the effect of teacher feedback

on L2 writing has increased, a great deal of recent research has been devoted to exploring

this aspect of second language education. Particular focus has been placed upon whether

feedback makes a difference in students‟ writing; what role it plays in enhancing students‟

writing; effective ways of delivering feedback; and how students react to it (Hyland &

Hyland, 2006). A number of studies have also investigated the nature and effect of

different types of feedback: written, oral, content, form-focused, teacher, peer, computer-

mediated, one-to-one teacher-student conferences, and so on. This proliferation of

research demonstrates the increasing importance of feedback in all writing programs, and

illustrates how teachers and students alike have striven for much finer-grained diagnostic

information about specific writing skills in an L2 context.

Along the same lines, researchers in educational assessment and measurement

have recently shown increased interest in diagnostic approaches that assess and monitor

students‟ progress in particular academic domains. According to Kunnan and Jang (2009),

The main vision in using diagnostic assessment in large-scale and classroom

assessment contexts is to help assess students‟ abilities and understanding with

feedback not only about what students know, but about how they think and learn

in content domains, to help teachers have resources of a variety of research-

based classroom assessment tools, to help recognize and support students‟

strengths and create more optimal learning environments, and to help students

become critical evaluators of their own learning (Pellegrino, Chudowsky, &

Glaser, 2001).

Partly in response to the limitations of outcome-based assessments from proficiency or

achievement tests, researchers turned to diagnostic assessments that can maximize

2

pedagogical gains by integrating assessments with instruction and curriculum (Nichols,

1994; Nichols, Chipman, & Brennan, 1995; Pellegrino & Chudowsky, 2003). Current

trends towards higher-quality education and greater accountability have also resulted in

an increasing demand for diagnostic information about individual students‟ strengths and

weaknesses in classroom-based and large-scale assessments (Leighton & Gierl, 2007).

The current No Child Left Behind (NCLB) legislation in the United States tracks

students‟ academic achievement, providing useful information to students, parents,

teachers, principals, and school district administrators (DiBello & Stout, 2007).

Standardized large-scale assessment is also moving toward diagnostic assessment; the

College Board‟s Score Report PlusTM

provides detailed information about the test

performance of students who have taken the Preliminary Scholastic Achievement Test

(PSAT) and the National Merit Scholarship Qualifying Test (NMSQT) (DiBello & Stout,

2007).

In L2 assessment and testing, the pressing need for diagnostic assessment is

illustrated by the advent of DIALANG. DIALANG is a European-funded project that

develops computer-based diagnostic tests assessing five aspects of language knowledge

(reading, listening, writing, grammar and vocabulary) in 14 European languages

(Alderson, 2005). It provides detailed information about test results to learners based on

the guiding principles of the Common European Framework of Reference for languages

(CEFR). Diagnostic scores are reported separately on each subskill in each aspect of

language knowledge, and students can review their assessment profiles and determine

which subskills they need to improve.

The importance of diagnostic information has also been emphasized in

constructing a rating scale. Acknowledging the limitations of behavior-based rating

scales, Brindley (1998) called for research into diagnosis-oriented rating scales:

Rather than continuing to proliferate scales which use generalized and

empirically unsubstantiated descriptors, therefore, it would perhaps be more

profitable to draw on SLA [Second Language Acquisition] and LT [Language

Testing] research to develop more specific empirically derived and

diagnostically oriented scales [italics added] of task performance which are

relevant to particular purposes of language use in particular contexts and to

investigate the extent to which performance on these tasks taps common

components of competence. (p. 134)

3

Recognizing the problems associated with intuitive or a priori methods in most rating

scales, he placed particular emphasis on empirical sources, as well as the diagnostic

functions that rating scales must have. In a similar vein, Pollitt and Murray (1996)

suggested diagnosis-oriented rating scales, pointing out the limited view of Alderson‟s

(1991) trichotomous classification of rating scales (i.e., user-oriented, constructor-

oriented, and assessor-oriented scales). The area of English for Specific Purposes (ESP)

was no exception; Grove and Brown (2001) proposed a diagnostic assessment scheme

that can help to assess medical students‟ oral communicative skills. Although she did not

empirically develop or validate it, Luoma (2004) also put forward the idea of a diagnostic

rating checklist for assessing L2 oral proficiency.

Despite the increasing interest in and need for a diagnostic approach to

educational assessment and L2 assessment and testing, very little research has been

devoted to it. The literature in this area is scant and the concepts are confusing, with no

theoretical foundation (Alderson, 2007). In addition, few principles exist that guide the

development of a diagnostic test, as we know little about what underlying constructs

should be identified, operationalized, and measured (Alderson, 2007). The technical

knowledge that frames diagnostic assessment is also in its early stages (Jang, 2008). In

particular, psychometric measurement models that operationalize diagnostic assessment

are relatively new and therefore little-explored methods in L2 assessment and testing.

Apart from Buck and Tatsuoka‟s (1988) pioneering introduction of the Rule-Space

procedure to L2 assessment and testing, only a handful of studies have attempted to

empirically explore the potential applications of psychometric diagnostic models to L2

assessment and testing. Jang (2005, 2009a) applied the Reduced Reparameterized

Unified Model ([Reduced RUM], Hartz, Roussos, & Stout, 2002) to two forms of the

reading subtest in the LanguEdge English Language Learning Assessment in order to

evaluate the effect of skills diagnosis on teaching and learning, while Lee and Sawaki

(2009a) investigated the comparability of different diagnostic assessment models

(General Diagnostic Model [von Davier, 2005], Fusion Model [Hartz et al., 2002], and

Latent Class Analysis [Yamamoto & Gitomer, 1993]) on the operational TOEFL iBT

reading and listening subtests. The lack of other significant research warrants further

investigation into how diagnostic models might be incorporated to L2 assessment and

4

testing.

Considering that students want substantial diagnostic feedback from their

teachers in L2 writing programs (Cohen & Cavalcanti, 1990; Ferris, 1995, 2003; Ferris &

Roberts, 2001; Hedgcock & Lefkowitz, 1994; Hyland, 1998; Lee, 2004; Leki, 1991,

2006; Zhang, 1995), it is conceivable that a diagnostic approach might be an appropriate

means of helping to guide and monitor students‟ L2 writing progress. While researchers

agree on the potential of such a method, a diagnostic approach has not been encouraged

in a direct L2 writing assessment context. For example, Alderson (2005) argued that “…

in the case of diagnostic tests, which seek to identify relevant components of writing

ability, and assess writers‟ strengths and weaknesses in terms of such components, the

justification for indirect tests of writing is more compelling” (p. 155-156). He further

expressed reservations about the use of diagnostic tests in assessing higher-order

integrated language skills:

Indeed, diagnosis is more likely to use relatively discrete-point methods than

integrative ones, for the simple reason that it is harder to interpret performance

on integrated or more global tasks. Thus, for the purpose of identification of

weaknesses, diagnostic tests are more likely to focus on specific elements than

on global abilities, on language rather than on language use skills, and on „low-

level‟ language abilities (for example, phoneme discrimination in listening tests)

than „higher-order‟ integrated skills. (p. 257)

Alderson‟s arguments suggest that L2 writing is a multi-faceted and complicated mental

process, and that it is difficult to deconstruct L2 writing into separate elements that

contain tangible diagnostic information; however, his claims have yet to be empirically

investigated and warrant further supporting evidence.

One significant challenge to providing appropriate diagnostic feedback on direct

L2 writing skill assessments is the limited number of writing items that students

complete in testing situations. Students in most large-scale foreign language assessment

programs are required to complete just a single writing item in a given amount of time,

and their L2 writing ability is judged holistically rather than analytically. An aggregate

single score is awarded for the overall quality of writing, but little information is

provided to students about their strengths or weaknesses with regard to specific L2

writing skills. Even when an analytic rating scheme is used, it is difficult to gain a

detailed descriptive evaluation beyond the separate subscore on each writing subskill.

5

The problem becomes more serious when a small number of writing items are subject to

psychometric measurement models, greatly increasing the likelihood of measurement

errors. This is particularly true in item response theory (IRT)-based diagnostic models,

which require a large sample in order to provide a stable estimate of skill parameters.

This problem must be resolved if L2 diagnostic models are to generate the most accurate

and fine-grained diagnostic information possible.

In response to this need for research, this study built and supported arguments

for the use of diagnostic assessment in English as a second language (ESL) academic

writing. In the two-phase research, a new diagnostic assessment scheme, called the

Empirically-derived Descriptor-based Diagnostic (EDD) checklist, was developed and

validated for use in small-scale classroom assessment. The EDD checklist assesses ESL

academic writing ability based on empirically-derived evaluation criteria and estimates

skill parameters in a novel way that overcomes the problems and limitations associated

with the number of items in diagnostic models. Interpretations of and uses for the EDD

checklist were validated using multiple data sources and from diverse perspectives.

Argument-based approaches to validity provided an overarching logical framework that

guided the development of the EDD checklist and justified its score-based interpretations

and uses. I hope that the argument-based evidentiary reasoning in this study will

ultimately help to examine whether scores derived from the EDD checklist can be used to

diagnose the domain of writing skills required in ESL academic context.

Argument-Based Approaches to Validity

Building on the work of Cronbach (1971, 1988), House (1980), and Messick

(1989), Kane (1992) argued that test-score interpretation is associated with a chain of

interpretive arguments, and that the plausibility of those arguments determines the

validity of test-score interpretations. Kane also made it clear that validity is connected to

the interpretation of a test score rather than to a test or to the test score itself. In his

seminal article, An argument-based approach to validity, he suggested that interpretive

arguments establish a network of inferences from observations to score-based

conclusions and decisions, and guide the collection of relevant evidence that supports

those inferences and assumptions. A series of different types of inferences are laid out in

6

interpretive arguments, each of which is articulated based on its underlying assumptions

(Crooks, Kane, & Cohen, 1996; Kane, 1992, 1994; Shepard, 1993). Influenced by the

school of practical reasoning or informal logic (Cronbach, 1982, 1988; House, 1980),

Kane further noted that interpretive arguments are practical in that they are based on

assumptions which cannot be taken as given, and that the evidence supporting those

assumptions cannot be complete. The arguments in test-score interpretations are therefore

plausible or credible, but not decisive, given all available evidence. Kane suggested three

criteria for evaluating so-called practical arguments: (a) the clarity of the argument, (b)

the coherence of the argument, and (c) the plausibility of assumptions. He took particular

care to note that the weakest and most questionable assumptions must be identified and

supported by multiple pieces of evidence (Kane, 1992, 2001, 2004), and that

counterarguments must be identified and refuted in order to reinforce practical arguments

(Cronbach, 1971; Kane, Crooks, & Cohen, 1999; Messick, 1989).

Kane (1992) and Kane et al. (1999) defined the inferences in interpretive

arguments as evaluation, generalization, extrapolation and explanation, and decision.

Each inferential link is based on an assumption that must be supported by evidence. The

first inference, evaluation, links observation of a performance to an observed score, and

is based on the assumptions that the observed performance and test-score interpretation

occur under the same conditions, and that the scoring criteria are used in an appropriate

and consistent manner. Evidence in support of the inference on evaluation is collected by

examining how a test is administered, how students‟ responses are scored, and what

scoring criteria are used. The second inference, generalization, links an observed score

on a particular test to a universe score (i.e., a score on a test that is similar to the one

from which the observed score is drawn), which assumes that the observed score is based

on random or representative samples from the universe of generalization. The evidence

supporting an inference on generalization involves reliability or generalizability analysis.

The third inference, extrapolation, links a universe score to a target score or score-based

interpretation, extrapolating from a narrowly-defined universe of generalization to a

score on a widely-defined target domain beyond the test. The underlying assumption is

that a score on a test reflects performance on a relevant target domain. Criterion-related

validity evidence can support the inference on extrapolation. The link, explanation, is a

7

theory-based inference that relates to the construct of interest, assuming that the theory is

plausible. The final inference, decision, links a score-based interpretation to a decision,

based on assumptions about the values and consequences of test use. Kane et al. (1999)

and Kane (2001) cautioned that each inference should be convincing; if any is not

convincing, the defensibility of the entire interpretive argument will be undermined.

Refining Kane (1992) and Kane et al.‟s (1999) earlier model of an argument-

based approach to validation, Kane (2001, 2002) further classified interpretations into

descriptive and decision-based or prescriptive. Descriptive interpretations involve

inference from a score to a descriptive estimation of examinees‟ ability without an

explicit statement about the use of test scores, whereas decision-based or prescriptive

interpretations involve making decisions about examinees based on the descriptive

interpretations. Descriptive interpretations are usually subsumed to decision-based

interpretations. Therefore, there might be cases in which a descriptive interpretation on a

particular test score is valid, but a decision-based interpretation of the use of the test

score is not valid (Kane, 2002). Kane (2002) also classifies inferences and their

supporting assumptions into semantic and policy. Semantic inferences and assumptions

involve descriptive interpretations of test scores, whereas policy inferences and

assumptions are associated with decision-based interpretations. Policy inferences and

assumptions are typically evaluated according to the consequences of a particular

decision: a policy with positive consequences is considered effective, whereas a policy

with negative consequences is ineffective (Kane, 2002). Despite the importance of

decision-based interpretation and policy inferences and assumptions, Kane (2002) points

out that most validity research has glossed over these areas (Kane, 2002). Argument-

based approaches have recently been viewed as a two-part scheme consisting of an

interpretive argument and a validity argument. An interpretive argument states an

intended interpretation and use of a test score, whereas a validity argument critically

evaluates the plausibility of the interpretive argument based upon empirical investigation

of its inferences and assumptions (Kane, 2002, 2004).

The substantive argument-based approach has been taken up by others. Mislevy,

Steinberg, and Almond (2002, 2003) proposed an assessment argument mechanism,

evidence-centered assessment design (ECD), which organizes interrelated assessment

8

elements from conceptual test development to operational processes. In the ECD

framework, coherent arguments are established and assessment elements are developed

to transform the arguments into an operational process (Mislevy et al., 2002). ECD views

assessment as evidentiary reasoning, and builds and supports an assessment argument

that aligns with the assessment purpose. Evidentiary reasoning is influenced by

Toulmin‟s (2003) argument structure, relying upon the chain of claim, data, warrant,

backing, and alternative explanations. According to Toulmin, a chain of reasoning makes

claims based on data, warrants, and rebuttals. A claim is a conclusion that we wish to

justify based upon data (what Kane called an interpretive argument). A datum (or data) is

(are) any available information, fact, or evidence on which a claim is built, while a

warrant is justification of the inferential link between the claim and the data. A backing

is any theory, previous research, experience, or evidence that supports the warrant, a

rebuttal is a counterclaim that undermines the inference from data to claim, and rebuttal

data are what support or weaken the alternative (Mislevy et al., 2003).

An argument-based approach such as Kane‟s interpretive argumentation and

Mislevy et al.‟s evidentiary reasoning has recently been promoted by Bachman (2005),

who proposed an assessment use argument emphasizing the central role of test use or

consequences in a validity argument. Bachman acknowledged that there are no

systematic principles or practical procedures explicitly linking scores and score-based

interpretations to test use and the consequences of test use, in spite of substantial

awareness of test use and consequences in validity arguments (e.g., Kane, 2001, 2002,

2004; Messick, 1989). He also pointed out that issues that fundamentally concern validity,

such as test usefulness (Bachman & Palmer, 1996), fairness (Kunnan, 2004), or ethics

(Lynch, 2001), have been separately addressed and have no direct association with

validity. This lack of a substantive and integrative approach to validity has led Bachman

to develop an assessment use argument based on Kane‟s interpretive argumentation,

Mislevy et al.‟s evidentiary reasoning, and Toulmin‟s argument structure.

An assessment use argument is a two-fold approach, consisting of an assessment

validity argument and an assessment utilization argument. An assessment validity

argument involves an inferential link from performance on a test to an interpretation of a

test score, whereas an assessment utilization argument links a score-based interpretation

9

to a decision or test use. An assessment utilization argument is associated with what Kane

(2001, 2002) called the decision-based or prescriptive interpretations and policy

inferences. Bachman (2003, 2005) argues that an argument-based approach to validity

should extend to an assessment utilization argument, underscoring the consequences of

test use.

Following Toulmin (2003), both argument structures are built upon a chain of

inferences supported by warrants and backing. In the assessment validity argument,

performance on a test comprises data, and a claim with regard to a test-score

interpretation is made based upon that data. In the assessment utilization argument, test-

score interpretations drawn from the assessment validity argument become the data, and

a decision is made based on the validity claim. The decision to be made becomes the

claim, which is supported by four different types of warrants: relevance, utility, intended

consequences, and sufficiency. Relevance and utility indicate the degree to which a test-

score interpretation is relevant to and useful for decision making, while intended

consequences determine whether using the assessment to make a decision will bring

beneficial consequences to assessment users. Finally, sufficiency refers to the degree to

which the assessment provides sufficient information to make a decision. The rebuttals

are counterclaims that do not justify the inference from the data to the claim, and four

different types of rebuttals can be articulated to challenge each warrant.1

The argument-based approach to validation provides a logical, coherent, and

unified set of procedures that can guide test developers and help assessment users to

formulate and justify score-based interpretations and assessment decisions (Bachman,

2005; Kane, 2001). This approach has been well accepted across disciplines, and has

been applied across a wide range of studies.

1 Bachman (2005) also suggests two general rebuttals: “reasons for not making the intended decision, or

for making a different decision,” and “unintended consequences of using the assessment and/or making the

decision” (p. 21). However, Fulcher and Davidson (2007) point out that these two rebuttals were derived

from his misunderstanding of the nature of rebuttal or counterclaim. Citing Toulmin (2003), where rebuttal

was defined as “circumstances in which the general authority of the warrant would have to be set aside (p.

94),” they made it clear that rebuttals should be related to warrants in order to refute the argument.

Rebuttals in assessment utilization argument should thus be associated with four types of warrants.

10

Overarching Research Framework

This study built and supported arguments for the score-based interpretation and

use of the EDD checklist in ESL academic writing. The central research questions were

formulated based upon the logical process of the argument-based approach to validity,

guiding a set of comprehensive procedures for the development of the checklist and

justifying its score-based interpretations and uses. In order to address various aspects of

validity inferences, the following assumptions pertaining to different types of evidence

were examined:

The empirically-derived diagnostic descriptors that make up the EDD checklist

are relevant to the construct of ESL academic writing.

The scores derived from the EDD checklist are generalizable across different

teachers and essay prompts.

Performance on the EDD checklist is related to performance on other measures

of ESL academic writing.

The EDD checklist provides a useful diagnostic skill profile for ESL academic

writing.

The EDD checklist helps teachers make appropriate diagnostic decisions and has

the potential to positively impact teaching and learning ESL academic writing.

The first assumption suggests that the empirically-derived descriptors that make

up the EDD checklist reflect knowledge, processes, and strategies consistent with the

construct of ESL writing in an academic context. In order to test this assumption,

theoretical discussions on ESL academic writing assessment were reviewed and

compared, with a special focus on ESL writing rating scale research and development

procedures. The extent to which EDD descriptors can be viewed independently of each

other or divided into multiple subskills of ESL writing were also explored from diverse

perspectives, using content reviews from ESL academic writing experts and statistical

dimensionality evaluations of the descriptors. If the checklist reflects a multidimensional

view of L2 academic writing (Cumming, 2001; Cumming et al., 2000) and assesses such

diverse aspects as content, organization, and language use, a theory-based inference

would be supported.

The second assumption addresses the potential impact of various sources of

random error associated with sampling conditions of observation. Rater and test method

effects are a critical factor that threatens the valid interpretations of test scores and

11

increases the likelihood of construct-relevant or irrelevant errors. If the scores derived

from the EDD checklist are constant under different conditions of observation involving

different teachers and essay prompts, the generalizability assumption will be supported. A

many-faceted Rasch model was used to explore the reliability issues associated with

teachers and essay prompts. If teachers exhibit random rating patterns or essay prompts

exhibit bias against the EDD checklist, it will undermine valid score interpretations.

The third assumption, related to concurrent or criterion-related validity, examines

the extent to which scores awarded using the EDD checklist are related to those awarded

using other measures of ESL academic writing. This assumption does not necessarily

seek convergent evidence among different measures of ESL academic writing because a

single measure should not automatically be the norm against which others are compared.

Instead, divergent evidence could provide additional insight into the target constructs that

the two different measures intend to assess. A correlation between scores awarded using

the EDD checklist and scores awarded using the TOEFL independent writing rating scale

was calculated. If the two sets of scores are highly correlated, it can be assumed that the

checklist is an effective measure of the ESL writing ability required in an academic

context. However, a low correlation does not necessarily mean that the EDD checklist

does not meet this criterion; rather, it will highlight the different purposes for which the

two measures were developed. While the TOEFL rating scale is intended to place ESL

students into the appropriate proficiency levels, the EDD checklist is intended to provide

them with fine-grained diagnostic feedback.

The fourth assumption suggests that writing skill profiles generated using the

EDD checklist will provide useful and sufficient diagnostic information about students‟

strengths and weaknesses in ESL academic writing. This assumption also examines the

extent to which score interpretations made using the EDD checklist are accurate and

reliable. The Reduced Reparameterized Unified Model ([Reduced RUM], Hartz et al.,

2002) was used to explore the diagnostic quality of the checklist from multiple

perspectives. If the evidence indicates strong diagnostic power, it will support the

interpretive inference.

The fifth and final assumption concerns the extent to which the EDD checklist

helps teachers make appropriate and confident diagnostic decisions and gives them a

12

positive perception of the checklist‟s diagnostic usefulness. The evidence needed to

support or reject this assumption was gathered primarily from teacher responses to a

questionnaire and in interviews. If teachers report that the EDD checklist helped them

make appropriate and confident diagnostic decisions and has the potential to positively

impact diagnosing ESL academic writing skills and improving their instructional

practices, it will support this assumption. However, if the checklist does not function as

intended and its use is thought to bring about potentially negative consequences, its

score-based interpretation might not be valid.

Research Questions

The purposes of this research were (a) to develop a new diagnostic assessment

scheme called the Empirically-derived Descriptor-based Diagnostic (EDD) checklist to

assess ESL academic writing skills, and (b) to validate the checklist‟s score-based

interpretations and uses using multiple data sources and from diverse perspectives.

Argument-based approaches to validity provided an overarching logical framework that

guided the development of the EDD checklist and justified its score-based interpretations

and uses. The five assumptions addressing the different aspects of interpretive arguments

were subsequently used to formulate the central research questions of this study:

1) What empirically-derived diagnostic descriptors are relevant to the construct

of ESL academic writing?

2) How generalizable are the scores derived from the EDD checklist across

different teachers and essay prompts?

3) How is performance on the EDD checklist related to performance on other

measures of ESL academic writing?

4) What are the characteristics of the diagnostic ESL academic writing skill

profiles generated by the EDD checklist?

5) To what extent does the EDD checklist help teachers make appropriate

diagnostic decisions and have the potential to positively impact teaching and

learning ESL academic writing?

13

Significance of the Study

This study will make significant contributions to theories of diagnostic L2

writing assessment and will have direct implications for instructional practices. Four

research areas are of particular relevance: (a) identification of the ESL writing construct,

(b) development of a diagnostic ESL writing assessment scheme, (c) application of

psychometric diagnostic models to performance assessment, and (d) integration of

feedback research in L2 writing and the diagnostic approach in educational assessment.

First and foremost, this study will enable researchers and test developers to

better understand the construct of ESL writing. Despite abundant research in L2 writing

theories, few studies to date have attempted to identify the latent structure of ESL writing

using both substantive and statistical approaches. This study has empirically identified

assessment criteria using ESL teachers‟ think-aloud verbal protocols, and has tested their

dimensional structure using a series of conditional covariance-based nonparametric

dimensionality techniques. The findings derived from these analyses will enrich theories

of ESL writing and will provide more specific direction for ESL writing assessment.

Second, this study reconceptualizes the current classification of L2 writing scales.

Despite the increasing need for diagnostic assessment, very few scales (e.g., Knoch‟s

[2007] diagnostic ESL academic writing scale) have been developed to diagnose students‟

L2 writing performance. In addition, although a few researchers (e.g., Pollitt & Murray,

1996) have proposed diagnosis-oriented rating scales, these ideas have not been fully

realized within the context of L2 writing assessment. This study responds to the need for

research in this area, and contributes to the current L2 writing scale literature by

developing a diagnostic ESL writing assessment scheme and validating its use.

Third, this study demonstrates the ways in which a psychometric diagnostic

model can be applied to performance assessment. Despite an increasing interest in

assessing productive language skills, the current applications of diagnostic models have

been limited to multiple-choice tests that measure only receptive language skills. This

limited approach has prevented a thorough investigation of students‟ speaking and

writing performance, and has resulted in only a few studies focused on reading and

listening. The ways in which this study has overcome this constraint are unique, and can

be extended to other diagnostic performance assessments.

14

Finally, this study fills a gap that exists between feedback research in L2 writing

and the diagnostic approach in educational assessment. Despite the same overarching

goal, the research focus in these two areas has been directed in different ways. Most

feedback research in L2 writing examines the effect of different types of feedback on L2

writing using a qualitative method or case studies, while diagnostic educational

assessment is focused primarily on developing and implementing a psychometric

diagnostic model using large-scale test data. This study expands the scope of feedback

research in L2 writing by introducing a new measurement technique and opening an

avenue for much-needed additional research.

Chapter Overview

There are seven chapters in this thesis. Chapter 1 provides an overview of the

research problem, focusing on the five validity assumptions that guided the checklist‟s

development and validation. Chapter 2 reviews relevant literature, giving special

attention to the theoretical frameworks of L2 writing assessment and diagnostic

assessment. Chapter 3 describes the methodology used in this study, and provides

information about participants, instruments, and data collection and analysis procedures.

Chapter 4 discusses the ways in which the EDD checklist was developed, and presents

the final checklist. Chapters 5 and 6 report the checklist‟s evaluation outcomes. Finally,

Chapter 7 synthesizes the research findings and discusses areas of future research.

Definitions of the key terms used in this study are provided in Appendix A.

15

CHAPTER 2

REVIEW OF LITERATURE

Approaches to L2 Writing Assessment

Demystifying the Construct of L2 Writing

Writing in a second language (L2) is a multi-faceted and complicated language

skill. A variety of linguistic and non-linguistic components constitute the construct of L2

writing, and text- and writer-related variables directly or indirectly interact with writing

processes and products. Numerous attempts have been made to define the construct of L2

writing and to assess L2 writing ability, but no all-encompassing framework has yet been

described (Cumming, 1998, 2001, 2002; Cumming, Kantor, Powers, Santos, & Taylor,

2000; Grabe, 2001). As Cumming (2001) noted:

Unfortunately, as we all know, there is no generally agreed-on definition of this

construct, let alone any substantiated model that is vying for this status. I know

all too well myself, from having tried over several years to start to construct,

with little empirical success, such a model in one setting (see Cumming & Riazi,

2000). Moreover, in recently reviewing the past 5 years‟ published research, … I

was only able to affirm that research has recently highlighted the

multidimensionality of L2 writing. (p. 214)

This view on the multidimensional nature of L2 writing was highlighted in the

development of a framework for the writing subtest of the 2000 Test of English as a

Foreign Language (TOEFL). Cumming et al. (2000) framed the test‟s guiding principle

by exploring multiple facets of a workable writing conception rather than a rigorous

writing construct, thereby realistically approaching what L2 writing ability really is.

Grabe‟s (2001) perspective differed slightly, relying on theoretical models that

have explanatory and predictive power to describe writing performance in a particular

setting. Although he concluded that these theories were limited to functioning as an

overarching framework for an L2 writing construct, they do seem to provide useful

insight into how L2 writing ability is organized and conceptualized. Two positions on

writing-as-process are worth particular mention: the cognitive view (e.g., Bereiter &

Scardamalia, 1987; Flower & Hayes, 1981; Kellogg, 1996) and the socio-contextual view

(e.g., Grabe & Kaplan, 1996; Hamp-Lyons & Kroll, 1997; Hayes, 1996; Sperling, 1996).

Flower and Hayes (1981) characterized writing as a cognitively complex mental act in

16

interaction with three sub-processes: planning, translating, and reviewing. This view

assumes that writing occurs in a nonlinear and recursive manner, with overlapping

process components. The socio-contextual view of writing, on the other hand, expands on

the cognitive model by taking additional variables that could affect writing performance

into account. Hayes (1996) reframed writing as an individual-environmental interaction

by focusing on such individual components as motivation and affect, cognitive processes,

working memory and long-term memory, and on such environmental components as

audience and writing task and the medium of writing. His view on writing is clearly

illustrated as follows:

Indeed, writing depends on an appropriate combination of cognitive, affective,

social, and physical conditions if it is to happen at all. Writing is a

communicative act that requires a social context and a medium. It is a generative

activity requiring motivation, and it is an intellectual activity requiring cognitive

processes and memory. No theory can be complete that does not include all of

these components. (p. 5)

In an attempt to organize the parameters involved in writing into a set, Grabe and

Kaplan (1996) proposed a detailed taxonomy of writing skills, knowledge bases, and

processes built on two theories: communicative competence (Bachman, 1990; Canale &

Swain, 1980) and ethnography of writing. The taxonomy was developed by identifying

situation variables such as settings, tasks, tests and topics and integrating them with

writer variables such as linguistic, discourse, sociolinguistic skills and strategies. Grabe

and Kaplan suggested that this taxonomic approach could provide valuable insights to

researchers, since most writing research is conducted without full consideration of factors

that could affect writing processes and outcomes.

Although these theoretical models contributed greatly to a general understanding

of how writing is organized and conceptualized, they originated in L1 writing

development, a context with limited applications in L2 writing (Grabe, 2001).

Acknowledging the absence of L2-specific models of writing, Silva (1990) suggested

that (a) L2 writing theory, (b) research on the nature of L2 writing, (c) research on L2

writing instruction, (d) L2 writing instruction theory, and (e) L2 writing instruction

practice should be integrated in such model building.

Cumming (1997) and Leki, Cumming, and Silva (2008) looked at the problem

from a somewhat different perspective. Instead of relying on unsubstantiated theories,

17

they presented several empirical approaches for defining and validating L2 writing ability.

One approach is to analyze the characteristics of written compositions by utilizing such

discourse analytic measures as morphological and syntactic features, and lexical and

grammatical errors. Another approach focuses on the rater perceptions and behaviors in

order to verify existing rating scales or empirically explore evaluation criteria (Connor &

Carrell, 1993; Cumming, 1990; Cumming, Kantor, & Powers, 2001, 2002; Lumley, 2002,

2005; Milanovic, Saville, & Shuhong, 1996; Sakyi, 2000; Smith, 2000; Vaughan, 1991).

These two approaches originated in two different areas of research: second language

acquisition (SLA) and language testing (LT), respectively.

The review of current L2 writing research suggests that, despite a concerted

effort to define the construct of L2 writing, no single theory explains what L2 writing

ability is and how it interacts with other cognitive and contextual variables; however, the

two methodological approaches related to discourse analysis and rater perceptions

frequently appear to be used to determine the qualities and dimensions of L2 writing. If

the construct of L2 writing can be reliably and validly operationalized using these

methods, valid inferences can be made about students‟ L2 writing ability.

Discourse Analytic Approach

Discourse analytic measures or objective measures (such as the number of T-

units, error-free clauses per T-unit, etc.) are increasingly used as a means of quantifying

the quality of L2 writing, and are believed to be reliable indicators of L2 writing

proficiency.2 These measures enable researchers to quantitatively describe observable

characteristics or qualities of writing performance by tallying the frequencies or

calculating the ratios of certain linguistic features that occur in a written corpus. Many

objective measures are conceptualized to consist of theoretical taxonomies that help to

gauge the subcomponents of L2 writing ability. For example, Wolfe-Quintero, Inagaki,

and Kim (1998) conducted a comprehensive analysis investigating the relationship

between L2 writing development and the frequencies, ratios, and indexes of accuracy,

fluency, and complexity measures. Acknowledging that most such measures tend to be

2 Hunt (1970) described T units as “the shortest units into which a piece of discourse can be cut without

leaving any sentence fragments as residue” (p. 188).

18

used in a more impressionistic than theoretical manner, they reviewed those that were

used in 39 studies of second or foreign language writing and attempted to identify the

most reliable and valid indicators of development in L2 writing. Wolfe-Quintero et al.

hypothesized that a linear progression of these measures would indicate increasing L2

writing proficiency and operationalized this proficiency to include such variables as

program levels, school levels, classroom grades, standardized tests, rating scales,

comparison with native speakers, and short-term changes in classes.

Wolfe-Quintero et al.‟s (1998) review of the discourse analytic approach

suggests that accuracy is the most researched of all the measures. Even though there is a

lack of consensus among SLA researchers with regard to how to define and

operationalize this concept (Arnaud, 1992; Casanave, 1994; Homburg, 1984; Larsen-

Freeman, 1978, 1983; Larsen-Freeman & Strom, 1977; Perkins, 1980; Vann, 1979; also

see Polio‟s [1997] extensive review on linguistic accuracy in L2 writing research), the

notion of freedom from error (Foster & Skehan, 1996) seems to be the most widely

accepted definition. Errors have been identified in various ways by researchers. Bardovi-

Harlig and Bofman (1989) classified them as syntactic, morphological, or lexical-

idiomatic, while Nas (1975; as cited in Homburg, 1984) categorized errors as first-,

second-, and third-degree, based on their gravity. Methods of counting written errors are

also many and varied. According to Wolfe-Quintero et al. (1998), two approaches are

prevalent: focus on the number of error-free production units (e.g., error-free T-units,

error-free clauses, etc) and on the number of errors that occur within certain production

units (e.g., errors per clause, grammatical errors per word, etc). After reviewing these

measures extensively, they suggest that errors per T-unit and error-free T-units per T-unit

are the most useful with regard to determining accuracy in L2 writing.

Less research has been done on fluency measures, possibly because its unique

nature (i.e., automaticity) is difficult to gauge in written communication. Indeed, Polio

(2001) questions whether fluency has any relation at all to quality of writing. Still, the

extent to which a writer can fluently produce written language has been quantified using

several measures. Of these, two analytic methods have drawn the interest of researchers:

frequency techniques that count the number of words, verbs, clauses sentences, etc and

ratio techniques that calculate the number of words per clause, words per sentence, words

19

per T-unit, etc (Wolfe-Quintero et al., 1998). Despite the popularity of both of these

methods in SLA and L2 writing studies, Wolfe-Quintero et al. (1998) suggest that ratio

measures are more effective than frequency measures in assessing L2 writing

performance, and that T-unit length (i.e., words per T-unit), error-free T-unit length (i.e.,

words per error-free T-unit), and clause length (i.e., words per clause) are the three most

useful indicators of L2 writing, regardless of writing task, target language, or the ways in

which L2 writing proficiency level is determined.

Complexity, which has been theorized to encompass multiple dimensions of

variation, density, and sophistication, has been examined from both grammatical

(Bardovi-Harlig, 1992; Bardovi-Harlig & Bofman, 1989; Casanave, 1994; Cooper, 1976,

1981; Hinkel, 2003; Homburg, 1984; Ishikawa, 1995; Kameen, 1979; Monroe, 1975;

Shaw & Liu, 1998; Vann, 1979) and lexical perspectives (Engber, 1995; Harley & King,

1989; Hinkel, 2003; Laufer, 1991; Laufer & Nation, 1995; Linnarud, 1986; McClure,

1991; Shaw & Liu, 1998). While the grammatical complexity of writing has been judged

primarily by the presence of specific grammatical features (e.g., passives, adverbial

clauses, nominal clauses, etc) or the ratios of those specific grammatical features within

certain production units (e.g., adverbial clauses per T-unit, coordinate clauses per T-unit,

passives per sentences, etc), lexical richness has tended to be assessed by ratio measures

(Wolfe-Quintero et al., 1998). According to Wolfe-Quintero et al. (1998), three types of

ratio measures are of specific interest: type/token ratios (e.g., word types per words, verb

types per verbs, etc), type/type ratios (e.g., sophisticated word types per word types,

basic word types per word types, etc), and token/token ratios (lexical words per words,

sophisticated lexical words per lexical words, etc). They reported that clauses per T-unit

and dependent clauses per clause were significantly related to the grammatical

complexity of L2 writing, and a word variation measure (i.e., total number of different

word types divided by the square root of two times the total number of words) and a

lexical sophistication measure (i.e., total number of sophisticated word types divided by

total number of word types) were significantly related lexical complexity measures.

Research has also focused on ways in which textual features that extend across

sentence boundaries can be quantified, particularly the extent to which textual structure is

tied together in extended discourse. This concept of cohesion and coherence has given

20

rise to a large body of research into the pertinent measures. Cohesion refers to explicit

linguistic cues that indicate interrelations between different parts of discourse (Reid,

1992), whereas coherence is the much broader and more complicated interaction in a

writer‟s cognitive processes (Beaman, 1984; as cited in Reid, 1992); cohesion is thus

regarded as a subcomponent of coherence (Halliday & Hasan, 1976; McCulley, 1985;

Yule, 1985). In a seminal publication entitled Cohesion in English, Halliday and Hasan

(1976) discuss the taxonomy of cohesive devices, which specifies five different types of

cohesive ties: substitution, ellipsis, reference, conjunction, and lexical cohesion (see their

work for a more in-depth discussion of these cohesive ties). The influence of this

pioneering work has resulted in a great deal of research to quantify the extent to which a

text holds together and to identify cohesive characteristics that differentiate good and

poor writing (Crowhurst, 1987; Evola, Mamer, & Lentz, 1980; Fitzgerald & Spiegel,

1986; Jafarpur, 1991; McCulley, 1985; Neuner, 1987; Reid, 1992; Tierney & Mosenthal,

1983; Witte & Faigley, 1981). The results of these studies have proven to be mixed; for

example, Witte and Faigley (1981) suggested that good essays tend to show a higher

density level in cohesion than poor essays, but Neuner (1987) found that good writers

used none of the cohesive devices more frequently than did poor writers.

The more complicated intersentential relationship, known as coherence, has been

examined using three major approaches. The first approach involves Vande Kopple‟s

(1985) different types of metadiscourse: text connectives, code glosses, illocution

markers, narrators, validity markers, attitude markers, and commentaries (for a more in-

depth discussion of metadiscourse types, see Vande Kopple, 1985). In a study comparing

good and poor ESL essays, Intaraprawat and Steffensen (1995) found that good essays

more often contained all types of metadiscourse features than did poor essays, and good

writers also utilized a wider range of metadiscourse markers in their writing. Cross-

cultural differences in metadiscourse use have been examined by Crismore, Markkanen,

and Steffensen (1993), who suggested that Finnish students have a higher density of

metadiscourse and use a hedging device more frequently than U.S. students.

The second approach focuses on how discourse topics develop through

sequences of sentences. Lautamatti (1978, 1987) developed a procedure called topical

structure analysis (TSA) to characterize the nature of coherence within texts and

21

identified three topical progressions: parallel, sequential, and extended parallel (for a

more in-depth discussion of TSA, see Lautamatti, 1978; 1987). In a study that applied

TSA to L1 writing, Witte (1983a) found that more proficient writers tended to use

parallel and extended parallel progressions more often than less proficient writers;

conversely, less proficient writers tended to use more sequential progression. On the

other hand, Schneider and Connor‟s (1990) L2 writing research study reported that high-

rated essays contained more sequential progression, while intermediate- and low-rated

essays contained more parallel progression. They suggested that inconsistent findings

might be attributed to different coding schemes and little information about reliability.

The final approach associated with coherence measures is topic-based analysis

(Watson Todd, 1998; Watson Todd, Thienpermpool, & Keyuravong, 2004). In searching

for an objective measure of coherence, Watson Todd (1998) and Watson Todd et al. (2004)

proposed topic-based analysis consisting of multiple procedures: (a) identifying key

concepts, (b) identifying relationships between key concepts, (c) linking relationships

into a hierarchy, (d) mapping discourse onto the hierarchy, and (e) identifying topics and

measuring coherence. Based upon an analysis of 28 written compositions, Watson Todd

et al. suggested that coherence evaluated using this methodology correlated closely with

coherence marks assigned by teachers.

The review of the discourse analysis approach suggests that most SLA studies

focus on (a) accuracy, (b) fluency, (c) complexity, (d) cohesion and (e) coherence, while

conceptualizing L2 writing ability. However, a careful examination of this approach

indicates that discourse analysis does not address such non-linguistic aspects of L2

writing as content relevance, effectiveness, originality, or creativity. This method is

therefore rather limited in its capacity to explain all of the factors that could affect L2

writing competence. As Péry-Woodley (1991) noted, “researchers became much more

cautious not to establish over-simplistic links between surface features of texts and

language development outside of discourse considerations, and adopted a more skeptical

and critical stance toward such as maturity and complexity” (p. 73-74).

Caution should also be used when objective measures are utilized in assessment.

As Ishikawa (1995) and Perkins (1983) argued, discourse analysis can be both time-

consuming and inefficient, particularly for classroom assessments. Even when teachers

22

take the time necessary to objectively measure their students‟ writing, it is not always

clear how the results can help students understand their own L2 writing ability. What

does a high score on “words in T-units” mean practically? What is the importance of a

high score on “third-degree errors” versus a low score on “lexical accuracy index”?

Should specific scores on such measures be an instructional goal? The granularity of

these measures is too small to be used in a realistic instructional setting, and profiling L2

ability in this way would therefore not be beneficial for teachers and students. As

Alderson (2005) pointed out, a discourse analysis approach focusing on a narrowly-

defined aspect of grammatical or morphological rules might not be the best way of

diagnosing L2 writing ability.

Rater Perceptions and Rating Scales

Rater perceptions on L2 writing.

Another way of examining L2 writing ability is by looking at rater perceptions

and rating scales. Most studies in this line of research utilized think-aloud verbal

protocols to determine rater scoring behaviors or processes, to empirically explore the

assessment criteria that they use, or to verify the accuracy of existing rating scales

(Connor & Carrell, 1993; Cumming, 1990; Cumming, Kantor, & Powers, 2001, 2002;

Lumley, 2002, 2005; Milanovic, Saville, & Shuhong, 1996; Sakyi, 2000; Smith, 2000;

Vaughan, 1991). Cumming (1990) identified 28 decision-making and assessment criteria

used by experienced assessors to evaluate L2 written compositions. These were

categorized into four foci (self-control, content, language, and organization) and two

strategies (interpretation and judgment). Each focus contained subcriteria further

specifying rater evaluation behaviors or criteria. For example, a focus on language was

broken down into (a) classifying errors, (b) editing phrases, (c) establishing level of

comprehensibility, (d) establishing error frequency, (e) establishing command of

syntactic complexity, (f) establishing appropriateness of lexis, and (g) rating overall

language use. Similar criteria were found in the new TOEFL. Cumming et al. (2001,

2002) documented 27 decision-making processes exhibited by experienced writing

assessors on ESL/EFL compositions; these were further characterized by three foci (self-

monitoring, rhetorical and ideational, and language) and two strategies (interpretation

23

and judgment).

Research interest has also been directed toward the ways in which evaluation

criteria for a rating scale could interact with rater perceptions and judgments. In a pivotal

study by Vaughan (1991), nine raters verbalized their thinking processes during the rating

process using a six-point holistic scale. Raters‟ comments were categorized into 14

general evaluation criteria, and the six most frequently mentioned assessment elements

were identified as (a) quality of content, (b) legibility of handwriting, (c) tense/verb

problem, (d) punctuation/capitalization error, (e) quality of introduction, and (f)

morphology/word form error. Of these, Vaughan found that raters most frequently

focused on content problems.

In a large-scale EFL testing context involving two Cambridge examinations

(First Certificate in English [FCE] and Certificate of Proficiency in English [CPE]),

Milanovic, Saville, and Shuhong (1996) asked 16 raters from diverse backgrounds to

report the evaluation components that they focused on when assessing EFL writing. A

wide range of elements were identified, including (a) length, (b) legibility, (c) grammar,

(d) structure, (e) communicative effectiveness, (f) tone, (g) vocabulary, (h) spelling, (i)

content, (j) task realization, and (k) punctuation. They also found that raters focused

more on vocabulary and content in high-level essays, and on communicative

effectiveness and task realization in intermediate-level essays.

Similar findings were reported by Smith (2000), who examined the ways in

which raters interpret and apply evaluation criteria in the Certificates in Spoken and

Written English (CSWE). Based upon six raters‟ verbal accounts, nine textual features

were identified that described the examinees‟ writing performance: (a) grammar, (b)

organization, (c) cohesion, (d) sentence structure, (e) punctuation/capitalization, (f)

spelling, (g) handwriting, (h) length of text, and (i) lexical choice. Conversely, the study

by Sakyi (2000) sought more global assessment criteria. Six raters were asked to describe

their rating processes using a five-point scale, with their comments categorized as

focusing on (a) content and organization, (b) grammatical and mechanical errors, and (c)

sentence structure and vocabulary.

In a more recent study, Lumley (2002) examined the ways in which four

experienced raters applied a rating scale on L2 written compositions. The scale provided

24

to the raters was developed for the writing subtest of the Special Test of English

Proficiency (STEP), and had four evaluation criteria: (a) task fulfillment and

appropriateness, (b) conventions of presentation, (c) cohesion and organization, and (d)

grammatical control. The findings indicated that even though the scale content seemed to

accurately reflect what raters pay attention to, there were conflicts among the descriptors

within the same criteria at the same level. The raters also focused on two additional

evaluation criteria (quantity of ideas and explicit cohesive devices) that were not

included in the STEP rating scale.

The analysis of rater perceptions indicates that there is some consensus on which

aspects of L2 writing ability should be assessed. Typically, three elements, (a) content, (b)

language use, and (c) organization, were consistently used. While the substance of the

construct of L2 writing using these elements is essentially the same, the ways it is

referred to may differ. For example, when written content was the focus of raters‟

assessments, it was called quality of content (Vaughan, 1991), quantity of ideas (Lumley,

2002), or task realization (Milanovic et al., 1996). Organization was also variably

referred to as structure (Milanovic et al., 1996) and use of explicit cohesive devices

(Lumley, 2002). Language use might be the case that showed a wide range of granularity.

It generally included grammatical, lexical, and mechanical features, but the grain size of

the features differed drastically. For example, Vaughan (1991) was more specific than

Smith (2000), breaking grammatical errors into smaller units such as tense and verb

problems. It is interesting to note that raters also paid attention to legibility of

handwriting, which would seem to be a construct-irrelevant factor of L2 writing.

Rating scales in L2 academic writing.

The construct of L2 writing can also be approached by examining existing rating

scales. Rating scales represent the underlying construct of a test and help raters to focus

on the skills or abilities intended to be assessed (Luoma, 2004; McNamara, 1996; Weigle,

2002). A content analysis of existing rating scales should therefore provide a good basis

for understanding the multi-faceted and complicated construct of L2 writing.

In a large-scale testing setting, the TOEFL is perhaps the best-known of all ESL

academic tests. It assesses the writing ability required in an academic setting, while its

25

rating scale scores the overall quality of the writing based on (a) development, (b)

organization, and (c) appropriate and precise use of grammar and vocabulary

(Educational Testing Service, 2007).3 Another well-known ESL test is the International

English Language Testing System (IELTS), in which academic writing tasks are scored

based on (a) task achievement, (b) coherence and cohesion, (c) lexical resource, and (d)

grammatical range and accuracy (University of Cambridge, British Council, & IELTS

Australia, 2007). The Michigan English Language Assessment Battery (MELAB) has

similar evaluation criteria: (a) clarity and overall effectiveness, (b) topic development, (c)

organization, and (d) the range, accuracy, and appropriateness of grammar and

vocabulary (University of Michigan, 2003).

In a classroom assessment context, the rating scale created by Jacobs, Zinkgraf,

Wormuth, Hartfiel, and Hughey (1981) might be the best-known and comprehensive. It

evaluates ESL written compositions based on (a) content, (b) organization, (c) vocabulary,

(d) language use, and (e) mechanics. The unique characteristic of this rating scale is that

each major criterion has fine-grained subcriteria; for example, effectiveness of language

use is assessed by the elements associated with syntactic structure, errors of agreement,

tense, number, word order/function, articles, pronouns, and prepositions.4

Most rating scales thus appear to have similar evaluation criteria (i.e., content,

language use, organization), but slightly different grain sizes. Different rating scales

might tap fundamentally the same underlying construct of L2 academic writing, but with

different levels of specificity. For example, the TOEFL rating scale assesses language-

specific factors using one general criterion, while the IELTS and Jacobs et al. use finer-

grained criteria such as (a) vocabulary, (b) grammatical range and accuracy, and (c)

mechanics. Different wordings or terminologies may also be used to describe evaluation

features that essentially refer to the same component. While the IELTS defines written

structure as coherence and cohesion, the TOEFL and MELAB define it as development

and organization.

3 Although the new TOEFL contains two types of writing tasks (integrated and independent), it is the

rating scale for independent writing tasks that is discussed in this section. A discussion of the integrated

writing tasks is beyond the scope of this thesis. 4 In Jacobs et al.‟s (1981) scale, language use focuses primarily on the use of grammatical knowledge in

written text.

26

Summary

Efforts to explain the construct of L2 writing have been made based upon

theoretical accounts, discourse analysis, and rater perceptions and rating scales. These

three approaches are synthesized in Table 1. Although construct elements can be broken

into even smaller units, it has not been done to enhance comparability across different

approaches. Construct elements are presented in a broad scheme; yet, their granularity

differs widely across approaches.

Grabe and Kaplan‟s (1996) theoretical taxonomy is unique in that it focuses not

only on linguistic discourse skills and strategies, but also on sociolinguistic aspects of

writing ability. Detailed accounts are given for each skill and strategy category in their

taxonomy. In discourse analysis approaches, however, L2 writing ability is determined

based upon (a) accuracy, (b) fluency, (c) complexity, (d) coherence, and (e) cohesion.

Such objective measures include errors per T-unit, words per T-unit, clauses per T-unit,

and number of discourse markers, and so on. The primary limitation of the discourse

analytic method is that it cannot conceptualize such non-linguistic aspects of L2 writing

as content relevance, originality, or creativity. The grain size of these measures is also too

small to be used in a realistic instructional setting.

From an assessment perspective, the underlying construct of L2 writing is

organized based on rater perceptions and rating scales, and the approaches utilizing rater

perceptions and rating scales seem more useful than the other approaches in an

assessment context. Although some variations in the specificity of evaluation criteria

exist, they tap the same fundamental substances of L2 writing ability: (a) content, (b)

language use, and (c) organization. It is noteworthy that most large-scale, institutional

rating scales (e.g., TOEFL and IELTS) do not consider text length and handwriting as

critical evaluation criteria, while raters often do (e.g., Milanovic et al., 1996; Smith, 2000;

Vaughan, 1991). This raises the interesting question of whether text length and

handwriting should be considered construct-relevant factors.

Despite different theoretical orientations, the three approaches provide

convergent evidence as to how the construct of L2 writing is defined and operationalized,

assessing content, organization, and language use (vocabulary, grammar, and mechanics)

as a common denominator. However, it should be noted that the nature of L2 writing is

27

malleable rather than fixed, and embodied within a specific context. Defining a construct

without considering its contextual variables would be useless. As Cumming et al. (2000)

insightfully noted:

Although educators around the world regularly work with implicit

understandings of what constitutes effective English writing, no existing research

or testing programs have proposed or verified a specific model of this, such as

would be universally accepted. Indeed, current ESL/EFL writing tests operate

with generic rating scales that can reliably guide the scoring of compositions but

which fail to define the exact attributes of examinees‟ texts or the precise basis

on which they vary from one another [italics added]. (p. 27)

The next section will continue the discussion of theoretical and empirical issues

associated with rating scales and their development.

28

Table 1

Synthesis of Writing Construct Elements

Construct

element

Grabe &

Kaplan

(1996)

Discourse

analytic

method

Cumming

et al.

(2002)

Vaughan

(1991)

Milanovic

et al.

(1996)

Smith

(2000)

Sakyi

(2000)

Lumley

(2002)

TOEFL

(2007)

IELTS

(2007)

MELAB

(2003)

Jacobs

et al.

(1981)

Sociolinguistic

knowledge

Content

Language use

Vocabulary

Grammar

Mechanics

Organization

Text length

Handwriting

29

Demystifying Rating Scales

Rating Scale Types

Rating scales have long been used in performance assessment, and are widely

considered to be useful tools for judging language performance. Rating scales provide

common metric systems or standards that enable comparisons across different languages

and contexts (Bachman & Savignon, 1986). Additionally, as many researchers have

suggested (e.g., Luoma, 2004; McNamara, 1996; Weigle, 2002), they function as a

blueprint that specifies what skills or abilities should be assessed, and further represent

the underlying construct that the test aims to assess. According to Davies, Brown, Elder,

Hill, Lumley, and McNamara (1999), a rating scale can be defined as follows:

A scale for the description of language proficiency consisting of a series of

constructed levels against which a language learner‟s performance is judged.

Like a test, a proficiency (rating) scale provides an operational definition of a

linguistic construct such as proficiency. Typically such scales range from zero

mastery through to an end-point representing the well-educated native speaker.

The levels or bands are commonly characterized in terms of what subjects can do

with the language (tasks and functions which can be performed) and their

mastery of linguistic features (such as vocabulary, syntax, fluency and cohesion).

(p. 153-154)

As this indicates, rating scales are typically expressed in numerical values or descriptive

statements to assess performance on a particular task. In order for such scores to be

meaningful, rating scales should be associated with not only the language constructs to

be assessed, but with the purposes and the audiences for the assessment within a specific

context (Alderson, 1991; Luoma, 2004).

Rating scales can be classified in a variety of ways. Alderson (1991) divided

them into three types, according to purpose, as user-oriented, assessor-oriented, and

constructor-oriented. A user-oriented scale allows those who are interested in using the

ratings (school or job applicants, admission officers, and so on) to interpret the meanings

of the reported ratings, while an assessor-oriented scale is developed to guide assessors‟

rating processes by specifying the ways in which performance features should be rated. A

constructor-oriented scale, on the other hand, provides test constructors with the guiding

specifications a test should contain. Luoma (2004) has similar rating scale classifications:

rater-oriented, examinee-oriented, and administrator-oriented. A rater-oriented scale helps

raters to make decisions, while an examinee-oriented scale provides performance

30

information about examinees‟ strengths and weaknesses. Finally, an administrator-

oriented scale provides concise overall performance information.

Brindley (1998) takes a slightly different view, distinguishing between behavior-

based and theory-derived rating scales. A behavior-based scale describes features of

language use within a specific context, whereas a theory-derived scale describes

language ability without dependence on specific content and context. This classification

is conceptually consistent with what Bachman (1990) calls “real-life” and “interactive-

ability” approaches (p. 325). Within Bachman‟s framework, a real-life scale would view

language ability as a unitary concept, not distinguishing the ability to be assessed from

the characteristics of the context in which language performance is elicited. An

interactive-ability scale views language ability as a multi-componential construct,

measuring language ability with no consideration for particular contextual features

(Bachman, 1990).

Rating scales are also varied in terms of scoring approaches. Cooper (1977)

differentiated between holistic and analytic evaluation, stating that holistic evaluation

refers to “any procedure which stops short of enumerating linguistic, rhetorical, or

informational features of a piece of writing” (p. 4). Analytic evaluation, on the other hand,

involves counting and tallying occurrences of particular linguistic features. He also

characterized holistic evaluation as a quick and impressionistic procedure for placing,

scoring, or grading written texts, and proposed several types of holistic evaluation,

including dichotomous scale, primary trait scoring, and general impression marking.

Weigle (2002) classifies different types of rating scales on the basis of generalizability

and the use of single or multiple scores; holistic and analytic rating scales are intended to

be generalized across writing tasks, but differ in that one provides a single score or

multiple scores. A primary trait rating scale, on the other hand, yields a single score by

focusing on one very specific writing feature.

Holistic rating scales, also known as a global or impressionistic rating scale,

assume that language ability is a single unitary ability (Bachman & Palmer, 1996) and

that a score for the whole is not equal to the sum of separate scores for the parts (Goulden,

1992). When using a holistic rating scale to assess writing performance, raters usually

take note of various aspects of a written text simultaneously, assigning a single score that

31

best reflects their general impression of that text. Holistic rating scales in L2 writing

include the American Council of the Teaching of Foreign Languages (ACTFL)

Proficiency Guidelines (American Council of the Teaching of Foreign Languages

[ACTFL], 2001), the TOEFL rating scale (ETS, 2007), the IELTS rating scale

(University of Cambridge, British Council, & IELTS Australia, 2007), and the MELAB

rating scale (University of Michigan, 2003). First published in 1986, the ACTFL Writing

Proficiency Guidelines define and measure a learner‟s functional writing competence

using a nine-point rating scale that ranges from Novice to Superior. The Guidelines

describe positive rather than negative aspects of writing for each level, focusing on the

kinds of tasks writers can do with their respective writing proficiency. The nature of the

TOEFL writing scale is slightly different from that of the ACTFL Guidelines. The

TOEFL writing subtest is intended to assess L2 academic writing ability in a large-scale

assessment setting, and contains six holistic score levels. Overall quality of writing is

assessed by comprehensively taking a variety of aspects such as content, language use,

and organization into account. Similarly, the IELTS and the MELAB writing scales

produce a single score based on the overall quality of written compositions, and contain 9

and 10 score bands, respectively.

White (1985) may be one of the best-known proponents of holistic rating scales

(Hamp-Lyons, 1991; Weigle, 2002). As he argued, holistic rating scales have several

advantages that other types of rating scales do not. Specifically, they are an economical

and practical means of scoring in that raters usually read a text just once (rather than

several times) and provide a single score (rather than multiple scores) in one minute or

less (Hamp-Lyons, 1991). This speedy rating process is certainly a benefit for raters and

testing agencies interested in saving time and money. Holistic rating scales also allow

raters to focus on the strengths of writing samples rather than their weaknesses, enabling

writers to be evaluated according to what they have done well (Weigle, 2002; White,

1985). Finally, holistic rating scales represent a humanistic approach to understanding the

authentic nature of writing versus “analytic reductionism” (White, 1985, p. 33).

Therefore, a holistic approach makes it possible to appreciate writing as a unified and

central human activity, not as segments split into detached activities (White, 1985).

32

Despite its apparent advantages, holistic rating scales have been severely

criticized for several reasons. One major weakness is the inability to supply diagnostic

information about writers‟ strengths and weaknesses beyond a relative rank-ordering

(Charney, 1984; Davies et al., 1999; Hamp-Lyons, 1991; Luoma, 2004; Weigle, 2002;

White, 1985). A given writer might be good at producing grammatically correct

sentences, but not at developing content or organizing thesis sentences. In a case like this,

a single score placing a writer‟s performance at a typical level of ability cannot

accurately identify what he or she has done well or poorly. The inability to accurately

illustrate the multi-faceted nature of writing is more problematic in an L2 context, since

fine-grained diagnostic feedback is necessary for L2 writers‟ interlingual development

(Hamp-Lyons, 1991; Weigle, 2002). As Hamp-Lyons (1995) rightly argued, “a holistic

scoring system is a closed system, offering no windows through which teachers can look

in and no access points through which researchers can enter” (p. 760-761).

Another criticism of holistic ratings lies in the difficulty of matching writing

texts to appropriate levels on a scale (Bachman & Palmer, 1996). It is often the case that

all the evaluation criteria of a holistic scale are not met concurrently, so that a rater must

(whether consciously or unconsciously) prioritize some criteria over others. For example,

when a writing sample matches up with the Level 2 descriptors of a rating scale in terms

of content development but not language use, raters must make a decision about the level

at which the text should be matched. The possibility that raters explicitly or implicitly

weigh particular features of writing is unavoidable, making interpretation of scores even

more difficult (Bachman & Palmer, 1996; Goulden, 1994).

By contrast, analytic rating scales assume that the sum of the separate scores

awarded to subcomponents of writing is equal to a single score awarded for the written

piece as a whole (Goulden, 1992). In an analytic scoring scheme, raters take note of

several aspects of writing and produce multiple ratings or subscores, which are then

weighed according to theoretical considerations or the test developer‟s specifications.

They can also be aggregated as a composite score according to test purposes. For

example, Jacobs et al.‟s (1981) ESL composition profile describes five aspects of writing

ability (content, organization, vocabulary, language use, and mechanics) and gives them

different weights (30, 20, 20, 25 and 5 points, respectively). Subscores might be more

33

useful for students and teachers who want to identify the strengths and weaknesses in the

writing, whereas a composite score would be more useful for admission officers or

employment committees who simply wish to select a candidate with competent writing

skills.

The use of analytic rating scales in written language assessment has several

practical advantages. First, the ratings assigned to each component can be used to

diagnose the relative strengths and weaknesses of written texts (Bachman & Palmer,

1996; Hamp-Lyons, 1991; Hamp-Lyons & Henning, 1991; Weigle, 2002). Whereas a

single, holistic score can obscure variations in writing ability, analytic ratings make them

visible, enabling the construction of writing profiles (Hamp-Lyons & Henning, 1991).

Profile scores are particularly helpful for L2 writers, who are more likely than their L1

counterparts to show an uneven or marked profile across different areas of writing ability

(Hamp-Lyons, 1991; Weigle, 2002). Another advantage of analytic rating scales is their

reliability (Hamp-Lyons, 1991; Hamp-Lyons & Henning, 1991; Huot, 1996; Weigle,

2002); unlike holistic ratings, analytic rating schemes award multiple scores to a single

written text, which ensures high reliability. Finally, analytic rating scales can better

represent raters‟ cognitive processes (Bachman & Palmer, 1996). According to Bachman

and Palmer (1996), raters tend to consider such individual components as grammar,

content and vocabulary even when they are asked to sort written texts according to

overall quality. This behavior supports the use of analytic rating scales because they

reflect the underlying construct of L2 writing.

Although analytic rating scales are favored by many L2 writing experts, it should

be noted that they are time-consuming and expensive (Perkins, 1983; Weigle, 2002;

White, 1985). Because raters are required to make several judgments based on the

criteria specified on the scale, assessments can take longer than ratings that use a holistic

scale. Davies et al. (1999) pointed out that focusing on each separate component of

writing ability can also distract raters from the overall quality of writing samples. From a

theoretical point of view, White (1985) also questioned whether “writing quality is the

result of the accumulation of a series of subskills,” arguing that “writing remains more

than the sum of its parts and that the analytic theory that seeks to define and add up the

subskills is fundamentally flawed” (p.123).

34

Unlike holistic and analytic rating scales, a primary trait rating scale is used to

assess an important writing trait that is required to accomplish a certain writing task

(Lloyd-Jones, 1977). This type of rating scale assumes that writing should be assessed

within a specific context, and that different rating scales should be developed for every

writing task and prompt. A typical primary trait rating scale consists of (a) the

exercise/task, (b) a statement of the primary rhetorical trait to be elicited by the exercise,

(c) a hypothesis about performance on the exercise, (d) a statement of the relationship

between the exercise and the targeted primary trait, (e) a scoring guide, (f) writing

samples, and (g) a justification of scores (Lloyd-Jones, 1977). Lloyd-Jones (1977) argued

that while primary trait rating scales were originally developed to score essays from the

National Assessment of Educational Progress (NAEP), their basic principles can be

applied in other contexts. For example, they can be used to develop summative

assessment of students‟ writing ability, to make curriculum evaluations, or to provide

specific feedback on a particular writing task. As its methodologies imply, the advantage

of a primary trait rating scale is in classroom instruction and assessment (White, 1985).

Teachers can focus on one narrowly-specified feature at a time rather than on all

characteristics of a writing sample, while students can receive a detailed and precise

description of that specific feature. A greater advantage of this scale is that it can

contribute to curriculum development and evaluations (Cooper, 1977; Lloyd-Jones, 1977;

Perkins, 1983; White, 1985). Teachers can adjust their curricula based upon the

information gathered from the primary trait assessment, rendering both teaching and

learning more effective. The direct connection with classroom instruction will also make

a diagnostic approach to writing more likely and valuable.

Nonetheless, even these advantages are diminished when the development

process for primary trait rating scales is taken into account. Lloyd-Jones (1977) reports

that developing such a scale requires not only a substantial theoretical background in

rhetoric but also a great deal of time: preparation of just one exercise can take from 60 to

80 hours. Rating scales also need to be developed anew for each new writing task or

prompt. For these reasons, primary trait rating scales are mostly used for research

purposes or in a large-scale test context such as NAEP (Hamp-Lyons, 1991). Little

information is therefore available on how to apply such scales to L2 writing assessment

35

(Weigle, 2002).

The review of rating scales suggests that rating scale types are associated with

not only the language constructs to be assessed but with the purposes and audiences for

the assessment within a specific context. Scoring methods also determine the type of

rating scale; holistic rating scales are efficient at placing writers into different proficiency

levels, whereas analytic and primary trait rating scales are better at identifying writers‟

strengths and weaknesses. Analytic rating scales are particularly useful for assessing

multiple facets of L2 writing, but cannot provide more fine-grained diagnostic

description beyond subscores across evaluation criteria. In this regard, primary trait

rating scales might be best suited for diagnostic purposes because they provide a detailed

and precise description of a narrowly-specified writing feature within a specific context.

However, their laborious developmental process is a critical problem. This suggests that a

new assessment approach is needed that can maximize the diagnostic gains from which

student-writers benefit. If it requires different forms of assessment protocols from those

of rating scales, they should be further developed and investigated based on both

theoretical and empirical grounds.

Problems with Rating Scales

In spite of the extensive use of rating scales in the area of L2 assessment and

testing, surprisingly little is known about their theoretical and empirical underpinnings.

Accountability is also questionable, as most rating scales originate in committee-

produced systems, and little information is publicly known about their development

procedures. Serious problems are inherent in these rating scales, and a significant amount

of criticism and concern has focused particularly on the ACTFL Guidelines. Although the

ACTFL Guidelines and their predecessors and successors (e.g., the Foreign Service

Institute [FSI], Interagency Language Roundtable [ILR], and Australian Second

Language Proficiency Ratings [ASLPR] scales) have exerted a great deal of influence on

language instruction and assessment for several decades, criticisms are unavoidable.

As many researchers have pointed out, the most serious problem is that it is not

always clear how rating scale descriptors were created (or assembled) and calibrated (e.g.,

Brindley, 1998; Chalhoub-Deville, 1997; de Jong, 1988; Lantolf & Frawley, 1985; North,

36

1993; Pienemann, Johnson, & Brindley, 1988; Upshur & Turner, 1995). Although

proponents of the ACTFL Guidelines explicitly argued that they were developed

experientially “by observing how second language learners progress in the functions they

can express and comprehend” (Liskin-Gasparro, 1984, p. 37; see for comprehensive

historical review on the ACTFL Guidelines), and “based largely on many years of

observation and testing in both the government context and in the academic community”

(Omaggio Hadley, 1993, p. 21), empirical evidence is scant. No information is available

on the ways in which the observational data were collected and analyzed to be

incorporated into the scale descriptors. The theoretical foundations are also shaky;

ironically, most rating scales have not been built on theoretical models of language

development, language ability, or communicative competence, as they argued. In contrast

to Ingram‟s claim (1984, p. 7; as cited in Brindley, 1998, p. 117) that the development of

the ASLPR was drawn on “psycholinguistic studies of second language development and

the intuitions of many years of experience teaching,” Brindley (1998) argues that neither

specific psycholinguistic studies nor SLA theories were taken into account in the

development process of ASLPR. Lack of theoretical and empirical understanding is the

most serious weakness of these rating scales, whereby the terms intuition-based rating

scales or a priori rating scales were derived.5

The way in which the difficulty hierarchy of linguistic criteria was determined in

these rating scales is also questionable (Bernhardt, 1984; Brindley, 1998; Chalhoub-

Deville, 1997; de Jong, 1988; Lantolf & Frawley, 1985; Lee & Musumeci, 1988; North,

1993; Turner & Upshur, 2002). North (1993) pointed out that the problem of “allocating

„key features‟ to levels without a principled basis, tapping into convention and clichés

among teachers and textbook and scale writers” (p. 5). A similar criticism was echoed by

Lantolf and Frawley (1985) and de Jong (1988), who raised questions about the grounds

on which persuading or detecting emotional overtones should be considered more

advanced than problem-solving or getting some main ideas in the ACTFL Guidelines.

Empirical findings about the validity of the difficulty levels have been mixed; although

5 According to Fulcher (1996b, 2003), an intuition-based or a priori method means developing rating

scales based on experts‟ (e.g., experienced teachers, language testers, or language testing specialists in

examination board) intuitive judgments on the development of language proficiency. Existing rating scales,

a teaching syllabus, or a needs analysis are often consulted in the development process.

37

Dandonoli and Henning (1990) found an adequate linear relationship between examinees‟

ability levels and the task difficulty level as put forth in the ACTFL Guidelines, Lee and

Musumeci‟s (1988) study failed to identify such a difficulty hierarchy. In reviewing

Dandonoli and Henning‟s study, Fulcher (1996a) claimed that their arguments for the

ACTFL Guidelines could not be validated due to problematic research designs and

analytical methods.

Another criticism is that the descriptors embedded in these rating scales are

implicitly norm-referenced rather than criterion-referenced for two reasons. First, these

rating scales evaluate L2 mastery versus the mastery of well-educated native speakers

(Bachman & Savignon, 1986; Chalhoub-Deville, 1997; Fulcher, 1987, 1997; Lantolf &

Frawley, 1985). Indeed, Bachman and Savignon (1986) and Lantolf and Frawley (1985)

expressed doubts as to whether “the” native speaker can actually exist: one language can

contain a myriad of dialects, registers, and vocabularies, so identifying features that

define a single, homogeneous group of native speakers is extremely difficult and

problematic. Descriptors on most scales are also written in terms relevant to their

adjacent descriptors. The level of performance is thus gauged by quantifiers (e.g., some,

many, a few, few) and quality indicators (e.g., satisfactorily, effectively, well) so that one

level of performance cannot be interpreted without dependence on the adjacent levels

(Alderson, 1991; Luoma, 2004; Matthews, 1990; North, 1993, 1996; North & Schneider,

1998; Turner & Upshur, 2002; Underhill, 1987). This interdependence makes it even

more difficult for descriptors to function as stand-alone criteria.

Monotonicity is another reason for the weaknesses of these rating scales (Fulcher,

1996b; Turner & Upshur, 1996, 2002). According to Turner and Upshur (1996, 2002),

typical rating scales assume monotonicity, but, in most cases, empirical ratings are based

on multiple descriptors that are present across different levels. Mapping qualitatively

different multidimensional descriptors onto unidimensional metric scales can influence

raters‟ decisions and impact consistency (Turner & Upshur, 1996). Along the similar

lines, Alderson (1991) and others (i.e., Matthews, 1990; Upshur & Turner, 1995) pointed

out that some rating scales include descriptors associated with abilities that are not

tapped by a test. According to Alderson, the mismatch between descriptors and content

arose during the English Language Testing Service (ELTS) Revision Project. If a test

38

contains only one type of text, then assessing a learner‟s ability to understand a wide

range of texts is meaningless.

From the perspective of assessors, Matthews (1990) discussed some problems of

the international EFL rating scales; the Royal Society of Arts Examination in the

Communicative Use of English as a Foreign Language (CUEFL); the Cambridge First

Certificate in English (FCE); the Certificate of Proficiency in English (CPE); the English

Language Testing Service (ELTS); and the International General Certificate of Secondary

Education (IGCSE). She argued that the evaluation criteria are sometimes arbitrary (in

case of ELTS Non-Academic part), that allocating equal weight across these evaluation

criteria may be unreasonable, and that such criteria were often not clearly defined,

leading to ambiguity.

In summary, these well-founded criticisms associated with many existing rating

scales can be attributed primarily to intuitive or a priori methods of scale development. A

lack of empirical grounds keeps scale developers and assessors from knowing which

evaluation substances should be assessed, resulting in low reliability and validity. This

problem becomes far more serious when a scale is used for diagnostic purposes. The

identification of specific assessment elements is regarded as the most important

procedure in implementing diagnostic assessment because these elements form the basis

of detailed skill profiles. These criticisms will remain until different paradigms and

approaches can be applied to the developmental process. As Brindley (1998) stated:

Rather than continuing to proliferate scales which use generalized and

empirically unsubstantiated descriptors, therefore, it would perhaps be more

profitable to draw on SLA and LT research to develop more specific empirically

derived and diagnostically oriented scales [italics added] of task performance

which are relevant to particular purposes of language use in particular contexts

and to investigate the extent to which performance on these task taps common

components of competence. (p. 134)

The next section will take up Brindley‟s call for empirically-derived scales, and a few

exemplary works will be reviewed in the context of L2 assessment.

Empirically-Based Rating Scales

Empirically-based rating scales have been proposed for language assessment in

response to criticisms of existing scales; the three best-known are the data-driven fluency

39

rating scale (Fulcher, 1987, 1993, 1996b, 1997), the empirically-derived, binary-choice,

boundary-definition (EBB) rating scale (Turner, 2000; Turner & Upshur, 1996, 2002;

Upshur & Turner, 1995, 1999), and the Common European Framework of Reference for

languages (CEFR) scale (North, 1996, 2000; North & Schneider, 1998). Each rating scale

provides valuable insights about that scale‟s development, although they were not

developed for L2 writing assessment per se. While Fulcher‟s data-driven fluency scale

demonstrates that discourse analysis can help to create scale descriptors for L2 oral

performance, Turner and Upshur‟s EBB scale illustrates the effectiveness of a series of

empirical yes/no criteria questions gleaned from actual performances. North‟s descriptor

scaling method in the CEFR also demonstrates that a combination of theoretical and

empirical approaches is useful in developing the framework of reference in which L2

performance levels are determined.

Data-driven fluency rating scale.

In order to define and measure the fluency of L2 learners, Fulcher (1987, 1993,

1996b, 1997) proposed a data-based or data-driven fluency rating scale based on

observations of oral performance. His data-based approach is built on the claim that

observed learners‟ performance should be quantifiable, and that the development

procedures of rating scales should reflect real linguistic performance. In contrast to a

priori methods, this data-based procedure utilized a large database of speech samples,

which were then used to create fluency rating descriptors.

Fulcher (1993, 1996b) collected 21 ELTS oral interviews with scores ranging

from 4 to 9 (average of 6). He identified eight categories using grounded theory (Glaser

& Strauss, 1967) to account for breakdowns in fluency, and counted observations of

these categories in each oral interview. These categories were (a) end-of-turn pauses, (b)

content planning hesitation, (c) grammatical planning hesitation, (d) addition of examples,

counterexamples or reasons to support a point of view, (e) expressing lexical uncertainty,

(f) grammatical and/or lexical repair, (g) expressing propositional uncertainty, and (h)

misunderstanding or breakdown in communication. A discriminant analysis was then

used to examine the extent to which the eight explanatory categories discriminated

among L2 learners, and the extent to which the awarded scores (i.e., frequencies) on the

40

eight explanatory categories predicted the scores awarded on the ELTS oral interview

scale. The results suggested that the eight explanatory categories can easily discriminate

among L2 learners, and that they were consistent with the ELTS oral interview scale in

assigning L2 learners to appropriate scale scores. Fulcher (1996b) considered these

results positive because they supported that this approach attained concurrent validity

before a rating scale was constructed, thus avoiding the weaknesses of post hoc

validation methods. The fluency descriptor scale was finally constructed by scaling the

means of each category score across all ELTS oral interview levels, and by eliciting the

salient characteristics of speech samples that defined the explanatory categories.

During the post hoc validation stage, Fulcher used a variety of statistical tests to

evaluate the five-point fluency rating scale: a G-study, ANOVA, and a Rasch partial

credit model. Applied to a new sample of students using five raters and three different

tasks (a picture description task, an interview based on reading text, and a group

discussion), the fluency scale achieved high reliability (reliability coefficient = 0.9, inter-

rater generalizability coefficient = 0.93, inter-task generalizability coefficient = 0.98).

The results from ANOVA also showed that the fluency rating scale was able to

discriminate among different learner group levels. Finally, while Fulcher failed to show

divergent validity evidence between scales, the scale calibration yielded by the Rasch

partial credit model supported the fluency rating scale, functioning as a stable

measurement instrument. These findings led Fulcher (1996b) to argue for two distinct

advantages to the data-based approach: target ability is defined in great detail so that

more accurate validation study is made possible, and the descriptors are explicit enough

to be linked to real language performance.

Empirically-derived, Binary-choice, Boundary-definition (EBB) scale.

Proposed by Turner (2000), Turner and Upshur (1996, 2002) and Upshur and

Turner (1995, 1999), empirically-derived, binary-choice, boundary-definition (EBB)

rating scales are characterized to be free of theory. EBB rating scales are not constructed

on theoretical models of language ability or learning, but on samples of real oral or

written performance (Turner & Upshur, 1996; Upshur & Turner, 1995). Instead of

generalizing to other contexts, they are usually developed within a particular context and

41

with a particular task and learner group in mind.

EBB scale development involves six steps (see Figure 1): (1) Individual raters

select a minimum of eight to ten samples that represent the full range of test performance,

dividing them into two categories (an upper half and a lower half); (2) As a team, the

raters discuss their decisions and reconcile disagreements, then identify the most

prominent feature distinguishing the upper half from the lower half, which is used to

develop a yes/no question (i.e., Question 1 in Figure 1); (3) Individual raters then rank

the samples in the upper half, and as a team, compare these rankings to determine the

number of scale levels that will effectively distinguish between the upper half samples;

(4) The team next creates a series of yes/no questions (i.e., Questions 2 & 3 in Figure 1)

that define subscale levels within the upper half samples; (5) The raters repeat steps 3

and 4 for the lower half samples; and (6) The raters, as a team, create descriptors of each

level so that the scores can be better understood when they are awarded. As can be seen,

EBB scales differ from traditional scales in that they describe salient differences (rather

than similarities) in the boundaries between score levels, and do not focus on a midpoint

of “normative descriptions of ideal performances” (Turner & Upshur, 1996, p. 55).

Figure 1. A general procedure for EBB scale development

Upshur and Turner (1995) argue that the simplicity and clarity with which EBB

scales distinguish boundaries eliminates the problems inherent in scales with co-

occurring characteristics, minimizing different interpretations of scale descriptors and

enhancing reliability. The floor or ceiling effect is also reduced because raters do not

Question

1

Question

2

Question

4

Question 3

Question

5

Level 6

Level 1

Level 2

Level 3

Level 4

Level 5

Yes

Yes

Yes

Yes

No

No

No

No

No

Yes

42

make assumptions about ability or other features that might not be present in the

performance but use empirical data as the starting point for scale development (Upshur &

Turner, 1995). Score interpretation is particularly meaningful within an educational

context, as scale criteria incorporate what teachers and students actually do in their

classes, embodying instructional and curricular goals (Turner, 2000; Turner & Upshur,

1996; Upshur & Turner, 1995).

Despite the many advantages of EBB scales, their inability to generalize across

different contexts has been criticized (Fulcher, 2003), as has their lack of theoretical

orientation. Indeed, Brindley (1998) suggests that scale development should be built on

empirical substantiation, “supplemented by further theoretically motivated research into

generalizable dimensions of task and text complexity” (p. 135). In addition, although

EBB scales take the defining characteristics of performance samples with different

proficiency levels into consideration when mapping language performance, it is unclear

whether teacher perceptions of skill hierarchy are psychometrically accurate. Research

has shown that teachers often fail to identify a hierarchy of task difficulty on a given test

(Alderson, 1990a, 1990b; Alderson & Lukmani, 1989). If teacher perceptions of task

difficulty hierarchy are correct in Figure 1, then Question 3 must assess higher-order

language skills than Questions 1, 2, 4, and 5. On the EBB rating scale for Audio Pal

(Turner & Upshur, 1996; Upshur & Turner, 1999), for example, teachers assumed that

the ability to speak fluently and use authentic idioms (the question demarcating level 6

from level 5) needed a higher-order language skill than the ability to speak a variety of

sentence structures without making many linguistic errors. This assumption has not been

rigorously investigated, however, and warrants future research. If teachers‟ perceptions

do not converge with statistical results, valid inferences about students‟ language ability

cannot be made.

Common European Framework of Reference for Languages (CEFR).

In Europe, the most notable recent development in language education and

assessment is surely the advent of the Common European Framework of Reference for

Languages ([CEFR], Council of Europe, 2001) (Figueras, North, Takala, Verhelst, &

Avermaet, 2005). The CEFR grew out of a concerted effort by the Council of Europe and

43

the Swiss National Science Research Council to develop a framework of reference

wherein communicative language performances can be scaled using a common meta-

language (North, 2000; North & Schneider, 1998). The primary stated goal of the CEFR

was “to help partners to describe the levels of proficiency required by exiting standards,

tests and examinations in order to facilitate comparisons between different systems of

qualifications” (Council of Europe, 2001, p. 21).

The pilot CEFR development projects took place in 1994-1995 in two phases.6

The first phase focused primarily on speaking and interaction competence in English, and

the second extended to cover noninteractive listening and reading competence in German

and French. Three steps were taken in each phase: (a) a comprehensive pool of

descriptors was created, (b) descriptors were qualitatively validated by consulting teacher

workshops, and (c) descriptors were scaled using teacher assessment and the many-

faceted Rasch model.

CEFR development began with a comprehensive survey of existing scales

describing language proficiency (see North, 1994). In the 1994 study, twenty-seven

scales describing speaking interaction and/or global proficiency were identified,

reviewed, edited into a single sentence form, and used to create a descriptor pool. After

eliminating the descriptors that were negatively worded, stated repetitively, or referenced

to a norm, approximately 1,000 descriptors were usable. A workshop was held in which

100 teachers were divided into small groups to judge the quality of descriptors and assess

students‟ oral performance on video clips. Assessment involved two techniques:

Thurstone‟s (1959, as cited in Pollitt & Murray, 1996) law of comparative judgments and

Smith and Kendall‟s (1963) method. Teachers were shown video clips in which a pair of

learners spoke to each other, and were asked to select the better performance and justify

their decision. This was done to elicit teachers‟ meta-language and to ensure the

descriptor pool was comprehensive enough to capture all instances of language

performance. In Smith and Kendall‟s (1963) method, pairs of teachers were asked to sort

a set of 60-90 descriptors into three or four categories that could represent categories of

language ability, and then to indicate descriptors that were particularly useful or clear.

6 Detailed methodological accounts are well documented in North (1996, 2000) and North and Schneider

(1998).

44

Approximately 400 descriptors were identified using this qualitative validation procedure.

These 400 descriptors were divided into seven primary questionnaires with

proficiency levels ranging from beginner to advanced. Each primary questionnaire

consisted of 50 descriptors, and was connected to the other questionnaires using 10-15

common descriptors. Mini-questionnaires were also created that linked teachers to each

other and to the primary questionnaires that they would not use to rate their own students.

Each mini-questionnaire consisted of a small number of descriptors selected from the

same level of primary questionnaire. A five-point Likert scale was attached to all

descriptors on both the primary and mini-questionnaires. One hundred participating

teachers rated ten students in their own classes (five each from two different classes)

using some of the primary questionnaires. A one-day rating conference was held three

weeks later in which all teachers used the mini-questionnaires to rate pairs of students on

11 video clips. The rating severity of each of the teachers could thus be estimated, and a

common scale constructed.

Collected ratings were entered into the FACET analysis, in which the stability of

linking descriptors was examined and misfitting descriptors were detected.7 Descriptors

that did not fit the model were generally related to sociocultural competence, work (e.g.,

telephoning, meeting, and formal presentations), negation, and pronunciation.8 After

these were eliminated, the remaining descriptors were calibrated on a common rating

scale. Ten cut-offs were set between each scale level, and then merged into six to match

with the levels that had been set for the CEFR.

The 1995 study examined whether the oral interaction scale for English language

can be replicated using other language skills in different languages. A similar procedure

was undertaken to construct listening and reading scales for French, German, and English.

After reviewing and editing a pool of descriptors, approximately 1,000 were found to be

usable. Workshops were held in which 192 teachers (81 French, 65 German and 46

English) evaluated the quality of the descriptors and rated students‟ performance. Four

7 Numerous technical problems occurred during these analyses; these are well documented in North

(1995). 8 According to North and Schneider (1998), the inability to calibrate socio-cultural competence suggests

that the scale is limited to measuring language ability rather than communicative competence. They note

that this result is consistent with the findings of Bachman and Palmer (1982), in which socio-linguistic

competence was distinguished from pragmatic and grammatical competence.

45

questionnaires were constructed from which 61 descriptors were linked to the 1994

English scale. When the FACET analysis was run on the rating datasets from both years,

the reading descriptors did not fit the model characterized by speaking and interaction,

and were thus calibrated separately. The listening descriptors were used to link this

separate reading scale to the other scales (i.e., speaking and interaction), so that reading

and listening descriptors could be analyzed together as a single set. The difficulty levels

between the two scales were adjusted, and the logit values of the listening and speaking

descriptors were highly correlated (r = 0.99) and linearly equated.

Based upon the results of the two pilot studies, the consistency between the two

scales was found to be satisfactory even though the linguistic backgrounds of the two

groups of teachers were different and the content and difficulty range of the two types of

questionnaires were different (North & Schneider, 1998). North and Schneider (1998)

also determined that scale difficulty was stable because the descriptors on similar issues

clustered adjacently onto the scale even though they were drawn from different

questionnaires. This suggests that consistent scales can be constructed in a principled

way using comprehensive surveys of existing scales, theoretical reviews and a priori

validation of descriptors, descriptor scaling based on a measurement model, and

replications of the scale (North & Schneider, 1998).

Theoretically-based and empirically-developed diagnostic rating scale.

In a recent study on L2 writing assessment, Knoch (2007) developed “a

theoretically-based and empirically-developed rating scale” for an L2 diagnostic writing

test and evaluated its diagnostic function. In the first part of her two-phase study, she

examined the existing literature to identify objective discourse measures that were

believed to best discriminate between writing samples at different proficiency levels.

These measures were then pilot-tested on 15 writing samples, and their discriminant

functions were determined based upon descriptive statistics (e.g., means and standard

deviations). In order to confirm that the measures that survived the pilot test had

sufficient discriminant function, 601 writing samples were evaluated and screened based

on their descriptive statistics (i.e., histograms, box-plots, and means) and the ANOVA

results. The resulting refined objective measures were finally used to construct a

46

diagnostic L2 writing rating scale assessing accuracy, fluency, complexity, mechanics,

coherence, cohesion, reader/writer interaction, and content. In the validation stage of the

study, 10 raters assessed 100 writing samples using the rating scale, and the quality of the

rating scale was evaluated using several statistics of the Rasch model: rater separation,

reliability and fit statistics, and scale step calibration. Raters‟ reactions to the scale were

also collected via questionnaires and interviews. After receiving satisfactory statistical

results and positive comments from raters, Knoch concluded that the theoretically-based

and empirically-developed rating scale was useful for a L2 diagnostic writing test.

It is noteworthy that Knoch (2007) attempted to develop a diagnostic L2 writing

scale based upon a theoretical model of L2 communicative competence and an empirical

evaluation of such theory-based models; however, the study had several limitations

related to the development of the rating scale. For example, objective measures were

selected based on the results of descriptive statistics from 15 writing samples. Knoch

explained that the small number of samples prevented the use of inferential statistics,

thus the results might not be decisive; however, she could have used a larger sample and

inferential statistics in the pilot study in order to ensure that appropriate measures would

be applied to scale construction from the beginning. There should also have been a better

explanation of the standard setting procedure. The way in which Knoch selected levels

for the rating scale seems arbitrary and impressionistic, with little evidentiary support.

For example, determining the level of fluency in an essay by counting the number of self-

corrections does not take essay length effect into account. Finally, it is also doubtful

whether it is even reasonable for human raters to assess writing samples using objective

measures. As the ever-growing body of literature on automated essay scoring shows,

machine raters might do so more efficiently.

Summary

Rating scales vary according to a test‟s purpose, audience, scoring methods, and

theoretical and empirical underpinnings. Acknowledging the problems associated with

intuitive or a priori methods in most scales, researchers turned their attention to empirical

methods. Of particular interest were Fulcher‟s data-driven fluency rating scale, Upshur

and Turner‟s EBB rating scale, and the CEFR scale. Unlike committee-based or

47

authority-based scales, these three are noted for their attempt to incorporate real language

performance into rating scale development.

This literature review suggests that assessment techniques built on empirical

sources are promising in that they substantiate the construct to be measured and draw on

concrete rationales and evidence. Empirical assessments create a dialogue among

stakeholders who might attach different philosophies, values, meanings, or purposes to

assessment. In that dialogue, assessment users play an active role as generators of

assessment criteria and interpreters of assessment outcomes, and are not passive listeners.

The nature of context-embeddedness also significantly enhances communication,

highlighting that no assessment can take place in isolation from its context and users.

These features are particularly relevant to the underlying concepts of diagnostic

assessment; in a diagnostic assessment framework, an ongoing dialogue with assessment

users can help to create a consensus about the elements to be evaluated, and can help to

keep them better informed about their particular strengths and weaknesses.

A unified assessment framework could therefore integrate the empirical

approach and diagnostic assessment; evaluation criteria would be identified from real

language performance and confirmed by theoretical accounts, and would then be used to

build a diagnostic assessment model. In that assessment model, each criterion would

represent a single evaluation element. Raters could then concentrate on one element at a

time, without the distraction of having to consider many evaluation criteria

simultaneously. Such a model could be created using an assessment scheme called an

empirically-derived descriptor-based diagnostic (EDD) checklist. The EDD checklist has

the potential to maximize the diagnostic benefit of assessment for various users. In order

to operationalize the model, however, a full understanding of diagnostic assessment is

necessary. The next section will discuss ways in which diagnostic assessment is

approached and implemented.

Approaches to Diagnostic Assessment

Diagnostic assessment is a subject of increasing interest in the language

assessment community, as researchers, recognizing the limitations of proficiency tests,

have turned their attention to assessments that contribute to instruction and curriculum

48

improvement (Alderson, 2005, 2007; Jang, 2005; Shohamy, 1992; Spolsky, 1992).

Kunnan and Jang (2009) note the characteristics of diagnostic assessment as follows:

The main vision in using diagnostic assessment in large-scale and classroom

assessment contexts is to help assess students‟ abilities and understanding with

feedback not only about what students know, but about how they think and learn

in content domains, to help teachers have resources of a variety of research-

based classroom assessment tools, to help recognize and support students‟

strengths and create more optimal learning environments, and to help students

become critical evaluators of their own learning (Pellegrino, Chudowsky, &

Glaser, 2001).

Shohamy (1992) also proposed an integrative diagnostic feedback testing model

describing how diagnosis components should be processed, operationalized, and applied.

In this model, she emphasized that the goal of tests should be improved teaching and

learning. The idea of diagnostic language assessment has also manifested in the

European-funded DIALANG project. This large-scale project operationalized and

validated the idea of diagnostic assessment in 14 European languages and five language

skill domains (reading, listening, writing, grammar, and vocabulary) through a computer-

based test tool.

Although it is a relatively new concept, cognitive diagnostic assessment (CDA)

has also been the cause of significant advancements in diagnostic language assessment.

CDA formatively assesses fine-grained knowledge processes and structures in a test

domain in order to provide detailed information about students‟ understanding of the test

materials (Nichols, 1994; Nichols, Chipman, & Brennan, 1995). This is fundamentally

different from summative assessment, which focuses on placing students onto a

unidimensional continuous scale (DiBello & Stout, 2007; Nichols, 1994; Snow &

Lohman, 1989). CDA assumes that the latent ability space is composed of a set of

knowledge states, skills, or attributes, and places students onto multidimensional space,

representing multiple skill parameters. Students‟ probability of achieving mastery of each

skill is then calculated, and student skill profiles are constructed.

Although only a few studies have explored the potential applications of

psychometric CDA models in language assessment, these provide valuable insight into

how a CDA framework could be incorporated. Buck and Tatsuoka (1998) and Kasai

(1997) applied the Rule-Space Model to L2 listening and reading tests, respectively, and

Jang (2005, 2009a, 2009b) applied the Fusion Model to examine the effectiveness of the

49

skills diagnostic approach to L2 reading on teaching and learning.9 From a slightly

different perspective, Sawaki, Kim, and Gentile (2009) used the Fusion Model to

accurately identify skill coding categories in L2 listening and reading tests.

Successful implementation of CDA requires a series of carefully-designed

substantive and statistical assessment processes. The selection of an appropriate

psychometric CDA model suited for that particular assessment purpose is also a

prerequisite. The next sections will discuss a series of steps involving CDA

implementation and a variety of psychometric CDA models. Of the many CDA models,

the Reduced Reparameterized Unified Model ([Reduced RUM], Hartz, Roussos, & Stout,

2002) will be discussed in-depth because it is the guiding psychometric diagnostic

assessment model used in this study. The Reduced RUM was chosen because it has been

the most extensively investigated model to date (Roussos et al., 2007a). Although it

might be possible to model students‟ ESL academic writing performance using other

conjunctive or compensatory CDA models, their stability has yet to be rigorously

examined. Ways in which the CDA framework has been empirically used in language

assessment in order to estimate student language proficiency will also be discussed.

Implementation of Diagnostic Assessment

DiBello and Stout (2007) consider CDA modeling an engineering science

because it requires cross-disciplinary collaboration, blending insights gained from

psychometrics, cognitive science, and curricular and instructional theories and practices.

It is an iterative and cyclic procedure, consisting of multiple steps (DiBello, Roussos, &

Stout, 2007).10

The CDA modeling process begins with a clear statement of the

assessment purpose, which will determine whether the targeted skill space will be

modeled unidimensionally or multidimensionally and whether student ability parameters

will be classified discretely (mastery/non-mastery) or scaled continuously. Once the

assessment purpose has been defined, the skills to be measured are specified in one of

two ways: if they are to be retrofitted to existing data in order to provide students with

fine-grained diagnostic feedback, they will be identified through substantive content

9 The Reparameterized Unified Model is formerly known as the Fusion Model.

10 For a more detailed description of diagnostic assessment implementation, see DiBello et al. (2007).

50

analysis. If, on the other hand, a new diagnostic test is to be developed, the targeted skills

will be aligned with the test‟s specific purpose and with those theories associated with the

test‟s content domain. Care must be taken when determining granularity of skills, so that

a similar grain size is assigned to each skill.

Test items are then assigned to the target skills or developed based on the

number and kind of skills to be measured, the relationship between skills, and their

difficulty level. The skills-by-items relationship can be conceptualized using an incidence

matrix, known as a Q-matrix (Tatsuoka, 1983). The Q-matrix specifies a relationship

such that the number 1 indicates that a given test item does measure a particular skill,

while a 0 does not. The construction of the Q-matrix requires both theoretical

consideration of the test domain and empirical statistical results because the quality of

the Q-matrix determines the quality of the estimated diagnostic model. A poorly created

Q-matrix provides less informative diagnostic or classification indices.

Once the Q-matrix is completed, it should be determined whether the

relationship among skills for a given item is conjunctive or compensatory. Conjunctive

interaction assumes that the successful completion of an item requires all the necessary

skills and that lack of competence on any one skill will result in failure on the item.

Conversely, compensatory interaction assumes that lack of competence on one skill is

compensated for by the mastery of others. If an appropriate diagnostic model has been

selected considering the skill relationship, it is calibrated and evaluated. Simple models

(involving a small number of skills per item) are preferable because they improve

parameter identification and model calibration and evaluation (DiBello et al., 2007).

Diagnostic results are then yielded, primarily focusing on the diagnostic function of a test

or item as well as the skill profiles of individual students. A user-friendly diagnostic

report is finally constructed and issued to students, teachers, and parents.

Psychometric Diagnostic Assessment Models

Recent advancements in psychometric CDA models have led to their

proliferation, further emphasizing the educational drive to diagnostic assessment.

Although Fischer‟s (1973, 1983) Linear Logistic Test Model (LLTM) failed to model the

ability parameter onto multidimensional space, it is considered the cornerstone of

51

multidimensionality-based diagnostic assessment models in that it represented the skills-

by-items relationship on an incidence matrix. Tatsuoka‟s (1983, 1990, 1993, 1995) rule-

space model is another groundbreaking work that has operationalized knowledge states

based on item response patterns. Many other psychometric models have been generated

to represent the student knowledge structure and to determine their mastery standing for

each skill.

Of the many variables that define the various psychometric CDA models, the

skill mastery scale is one determining factor. When the student knowledge structure is

represented as either mastery or non-mastery, latent class models such as Deterministic-

Input, Noise-And ([DINA], Haertel, 1989), Deterministic-Input, Noise-Or ([DINO],

Templin & Henson, 2006), Noise-Input, Deterministic-And ([NIDA], Junker & Sijtsma,

2001), or Reparameterized Unified Model ([RUM], DiBello, Stout, & Roussos, 1995;

Hartz, 2002; Hartz et al., 2002) are appropriate. On the other hand, if examinees are to be

scaled onto a continuous ability continuum, latent trait models such as the compensatory

multidimensional IRT model ([MIRT-C], Reckase & McKinley, 1991) and the

noncompensatory multidimensional IRT model ([MIRT-NC], Sympson, 1977) can better

structure the knowledge state. The ways in which skills interact with each other in an

item also characterize the nature of models. Conjunctive diagnostic models (e.g., DINA,

NIDA, RUM, MIRT-NC) require all necessary skills to be utilized to get an item correct,

while compensatory diagnostic models (e.g., DINO, MIRT-C, RUM) allows

compensation for low competence in one skill with high competence in others. The

completeness of the Q-matrix can also distinguish one psychometric model from another.

Some diagnostic approaches take mastery of non-Q skills into consideration (e.g., RUM),

while others do not.

The Reduced Reparameterized Unified Model (Reduced RUM) is a latent class

conjunctive model because it assumes that students‟ latent ability space can be

dichotomized into mastery and non-mastery and that students must master all required

skills to get an item correct (Roussos et al., 2007b). In a Q-matrix representation, items i

= 1, …, I are defined associated with skills k = 1, …, K, with qik = 1 indicating that skill k

is required by item i whereas qik = 0 indicating that skill k is not required by item i.

Examinees‟ ability parameters are thus modeled as

52

1 if examinee j has mastered skill k

jk =

0 otherwise

In the Reduced RUM, the probability of a correct response is modeled as

P (Xij = 1j ) = *

i

K

k

q

ikikjkr

1

)1(*

The parameter *

i is the probability of correctly applying all of the Q-specified

skills to solve item i assuming that a student has mastered all of these skills. It can be

understood as item difficulty, and values of *

i less than 0.6 suggest that items assigned

to the skills are too difficult (Roussos et al., 2007b). The parameter r *

ik is the probability

ratio for a correct item between mastery and non-mastery of a skill k. It is analogous to

an inverse indicator of how well an item discriminates on Q-specified skills. Values of r

*

ikless than 0.5 indicate that items have strong discriminant power, whereas values of r *

ik

greater than 0.9 suggest that items are not discriminating for skill k. When items are

found not to have strong discriminant power, the “1” entries in the Q-matrix should be

eliminated (Roussos et al., 2007b).

Arpeggio (DiBello & Stout, 2008) is estimation software for the Reduced RUM

that employs a Markov Chain Monte Carlo (MCMC) algorithm within a Bayesian

modeling framework. MCMC convergence can be examined by visually inspecting chain

plots, distributions of estimated posterior, autocorrelations of the chain estimates, and

computing Gelman and Rubin R̂ in multiple chains (Roussos et al., 2007b). If chains or

posteriors are stably distributed or if autocorrelations are low after the burn-in phase,

convergence has occurred. When multiple chains are employed, R̂ values less than 1.2

are also indicative of convergence (for more details, see Gelman, Carlin, Stern, & Rubin,

1995; Gelman & Rubin, 1992).

After convergence has been achieved, parameter estimates are evaluated in order

to enhance statistical power. The estimates for the *

i and r *

ik parameters are the

critical factors determining the diagnostic capacity of test items, and should be carefully

examined. If they do not contribute useful diagnostic information to the item response

function in relation to Q-specified skills, the elimination of Q entries is considered in a

53

stepwise manner. Dropping non-influential item parameters from the Q-matrix should be

undertaken carefully based on both substantive and statistical grounds (Roussos et al.,

2007b). Once the parameter estimates are evaluated, model fit needs to be examined in

various ways. If the model fit is satisfactory, the diagnostic quality of the model is

examined and a skill mastery profile is constructed using the item and student statistics

generated by Arpeggio.

Applications of Diagnostic Assessment Models to L2 Assessment

Only a handful of studies have explored potential applications of psychometric

CDA models in L2 assessment and testing, possibly because such models are newer, and

have subsequently been less explored. Buck and Tatsuoka (1998) were the first to apply

CDA models to L2 language assessment. Utilizing Tatsuoka‟s (1983, 1990, 1993, 1995)

revolutionary work on rule-space methodology, they identified the cognitive and

linguistic attributes underlying a L2 listening comprehension test and classified

examinees into specific knowledge states. The rule-space methodology deconstructs

items that assess a targeted ability into several attributes or skills representing the

underlying knowledge structure, and estimates the probability that each examinee has

mastered each attribute based on correct or incorrect response patterns. Buck and

Tatsuoka analyzed the responses of 412 Japanese students on 35 dichotomously-scored

L2 listening comprehension test items, and identified 71 attributes representing the L2

listening construct. Using visual inspection, correlations with item difficulty, and

multiple regression, they reduced the number of attribute candidates to 17. An incidence

Q-matrix was then constructed using these 17 attributes and analyzed using the rule-

space procedure. Fourteen interactions among attributes were identified, and a total of 31

attributes (17 prime attributes and 14 interactions) classified 91% of examinees into

specific knowledge states. The prime attributes set was modified to fully explain the

response patterns of the remaining 9%, resulting in the reduction of the number of prime

attributes. In the second run of the rule-space procedure, 15 prime attributes and 14

interactions classified 96% of the examinees into specific knowledge states.

Although the rule-space methodology successfully classified examinees with

different ability levels into appropriate knowledge states, it had several limitations. Most

54

significantly, Buck and Tatsuoka noted that the use of multiple regression could cause

useful attributes to be rejected. They also pointed out that their attributes set did not

include variables related to vocabulary or syntactic complexity, and expressed

reservations about the extent to which the identified attributes could be generalized to

other L2 listening tests. Finally, even though they believed the results from the rule-space

methodology could be used to develop a diagnostic report, they called for further

research into the ways in which complicated attribute-based rule-space results could be

easily communicated to teachers and students.

Jang‟s (2005, 2009a, 2009b) study on L2 reading is the most comprehensive and

thorough example of how a series of CDA techniques can be utilized. Two forms of the

reading subtest in the LanguEdge English Language Learning Assessment were used to

examine the effects of diagnostic assessment approaches to a large-scale L2 reading

comprehension test on teaching and learning practices. A three-phase study was designed

involving multiple data sources and procedures. In the first phase, substantive and

statistical analyses were used to identify the knowledge structure of the reading test; 12

ESL students provided verbal reports describing their reading processes and strategies,

and nonparametric latent dimensionality analyses utilizing CCPROX/HCA (Roussos,

Stout, & Marden, 1998), DIMTEST (Stout, Froelich, & Gao, 2001), and DETECT

(Zhang & Stout, 1999) evaluated the proposed skills-by-items dimensional structure.

Nine reading subskills were substantively and statistically identifiable, and the resultant

skill set was entered into the Q-matrix construction.11

Jang (2005, 2009a, 2009b) then used the Reduced RUM (Hartz et al., 2002) to

evaluate the quality of skill profiles, with special attention to model calibration, skill

homogeneity, and performance differences between masters and non-masters. Six and

seven entries of the Q-matrix had to be eliminated from the two forms of the reading

subtest, respectively, due to low item discrimination power for an assigned skill, and

approximately 20 % of the items were determined to be diagnostically less informative.

In an attempt to examine the application of the skill diagnostic approach in a real L2

11

The nine reading subskills include (a) deducing word meaning from context, (b) determining word

meaning out of context, (c) recognizing syntactic elements/discourse markers and integrating syntactic and

semantic links, (d) processing explicit information, (e) paraphrasing implicit information, (f) processing

negative statements, (g) inferential comprehension process, (h) summarizing major ideas, and (i) mapping

contrasting ideas into a conceptual framework.

55

reading instruction context, ESL teachers and students were interviewed and surveyed

about its usefulness and effectiveness on their teaching and learning practices. Score

differences among 27 students enrolled in two TOEFL preparation courses were also

examined in pre-instruction and post-instruction settings. The results showed that some

of the students‟ reading subskills improved after instruction, and that both teachers and

students viewed the diagnostic approach positively. Jang proposes that skills diagnostic

assessment can have a positive effect on both teaching and learning, and suggests that

when a diagnostic test is aligned with learning and cognition theories, it will contribute to

more meaningful diagnostic assessment.

Lee and Sawaki (2009a) examined the comparability of the General Diagnostic

Model ([GDM], von Davier, 2005), the Fusion Model ([FM], Hartz et al., 2002), and

Latent Class Analysis ([LCA], Yamamoto & Gitomer, 1993) on the reading and listening

subtests of the two field test forms of the TOEFL iBT. The two test forms consisted of 39

and 40 reading items, respectively, and 34 listening items. Two groups of TOEFL test-

takers (2,720 and 419 test-takers, respectively) took one form of the tests and a small

subsample of test-takers (374 test-takers) took both test forms. Four reading and four

listening skill categories were developed, and the items were coded and entered into a Q-

matrix.12

The results indicated that all three models appropriately classified test-takers

into a mastery or non-mastery state for most reading and listening skills, and that a

moderate degree of across-form consistency was achieved for most reading and listening

skills. When the skill profiles were examined across the three models, a great number of

test-takers were classified into flat profiles: “1111” (mastered all) and “0000” (mastered

none). Lee and Sawaki (2009a) speculated that the inability to identify the

multidimensional structure of the TOEFL reading and listening subtests might be because

the test was developed on a single latent continuum, thus providing some support for the

test‟s unidimensionality. However, the level of granularity might have been inappropriate,

requiring validation from empirical evidence such as students‟ think-aloud verbal reports

(Sawaki, Kim, & Gentile, 2009). The Q-matrix also needs to be validated because

12

The reading skills include (a) understanding word meaning, (b) understanding specific information, (c)

connecting information, and (d) synthesizing and organizing information. The listening skills were (e)

understanding general information, (f) understanding specific information, (g) understanding text structure

and speaker intention, and (h) connecting ideas.

56

poorly-defined skills-by-items relationships often result in flat profiles.

Limitations of Diagnostic Assessment Models

Although parametrically complex CDA models have made considerable progress,

it has been argued that such models must be substantively validated using internal and

external criteria in a real assessment context (DiBello et al., 2007). Most applications

tend to retrofit skills to pre-existing proficiency tests, and thus the effect on a carefully

designed diagnostic test is unknown (DiBello et al., 2007; Lee & Sawaki, 2009b; Jang,

2009a, 2009b). The lack of guidelines by which to identify skills and systematically

construct a Q-matrix is also problematic (Lee & Sawaki, 2009b). Theoretical and

empirical principles are needed to support well-defined skills-by-items representations.

Another challenge is finding an efficient means of communicating diagnostic results to

students, teachers, and other stakeholders (DiBello et al., 2007). CDA models compute a

large number of parameters, but little research (except for Jang‟s [2005, 2009a]

DiagnOsis) has been conducted to develop an effective score reporting method.

Numerically overwhelming score reporting procedures threaten both usefulness and

practicality, and ultimately prevent easy communication with stakeholders. A final

limitation is the availability of computer software (Lee & Sawaki, 2009b). Technical

developments are still in their early stages, and promising in-house software has rarely

been made commercially available. Wider accessibility is needed to validate its model

calibration and evaluation in diverse research areas.

57

CHAPTER 3

METHODOLOGY

Research Questions

The central research questions were formulated based upon the argument-based

approach to validity, as follows:












Research Design Overview

This is a two-phase study. Phase 1 concerns the development of the EDD

checklist, while Phase 2 pilots, models, and evaluates the EDD checklist. Given the

complex nature of argument-based validation inquiry, this study followed a mixed

methods research design. A mixed methods approach strives for knowledge claims

grounded on pragmatism and incorporates quantitative and qualitative research methods

and techniques, either simultaneously or sequentially, into a single study (Creswell,

2003). The use of multiple methods has the potential to reduce biases and limitations

inherent in a single method while strengthening the validity of inquiry (Greene, Caracelli,

& Graham, 1989). A series of validity arguments and assumptions determined the types

of data to be collected, which were then analyzed and synthesized using both quantitative

and qualitative methods. Of many mixed methods designs, an expansion design (see

Greene et al., 1989 for a review of mixed methods evaluation designs) was particularly

58

well suited to this study because it offered a comprehensive understanding of the EDD

assessment, examining diverse aspects of the validity claims. A complementarity design

was also pertinent because it investigated overlapping but different aspects of the EDD

score-based interpretations and uses that different methods might have elicited. The same

weight was given to both quantitative and qualitative methods. Table 2 contains a

summary of the research questions, participants, instruments/data, and

procedures/analyses over the two phases.

59

Table 2

Research Design Summary

Phase Research question Participants Instrument/Data Procedure/Analysis

1 1) What empirically-derived

diagnostic descriptors are

relevant to the construct of

ESL academic writing?

9 ESL teachers

10 TOEFL essays (5 proficiency

levels × 2 prompts)

A think-aloud verbal protocol

Nine ESL teachers thought aloud while

assessing and providing diagnostic

feedback on 10 TOEFL essays.

4 ESL academic writing

experts

EDD descriptors Four ESL academic writing experts

reviewed and sorted EDD descriptors

that would constitute the checklist.

2 2) How generalizable are the

scores derived from the EDD

checklist across different

teachers and essay prompts?

7 ESL teachers

80 TOEFL essays (40 essays × 2

prompts)

EDD checklist

Teacher questionnaire I

Interview protocol

In the pilot study, seven ESL teachers

assessed 80 TOEFL essays using the

EDD checklist. They were then asked to

complete a questionnaire and were

interviewed about the use of the

checklist. The preliminary analysis was

conducted using FACETS to examine

score generalizability.

3) How is performance on the

EDD checklist related to

performance on other

measures of ESL academic

writing?

7 (and 10) ESL teachers

Scores awarded using the EDD

checklist on 80 (and 480) TOEFL

essays

Scores awarded by ETS raters on

80 (and 480) TOEFL essays

A correlation analysis was conducted.

4) What are the characteristics of

the diagnostic ESL academic

writing skill profiles generated

by the EDD checklist?

10 ESL teachers 480 TOEFL essays (240 essays ×

2 prompts)

EDD checklist

Teacher questionnaire II

Interview protocol

In the main study, 10 ESL teachers

assessed 480 TOEFL essays using the

EDD checklist. The scored data were

analyzed to examine the dimensional

structure of ESL writing. The diagnostic

quality of the estimated model was then

examined using the Reduced RUM. The

teachers also completed a questionnaire

and were interviewed to evaluate the use of the EDD checklist.

60

Table 2 (Continued)

Phase Research question Participants Instrument/Data Procedure/Analysis

5) To what extent does the EDD

checklist help teachers make

appropriate diagnostic

decisions and have the

potential to positively impact

teaching and learning ESL

academic writing?

10 ESL teachers Questionnaire and interview

results

The teachers‟ questionnaire and

interview results were analyzed for their

positive or negative reactions to the use

of EDD checklist

61

Phase 1



The primary purpose of Phase 1 was to identify descriptors that are relevant to

the construct of ESL academic writing. Nine ESL teachers participated in a think-aloud

session to verbalize their thought processes while assessing and providing feedback on

10 TOEFL iBT independent essays. These verbal accounts provided rich descriptions of

ESL academic writing ability and served as the base for constructing the pool of EDD

descriptors. The recorded verbal data were fully transcribed and coded iteratively in order

to identify distinct ESL academic writing subskills and textual features. Four ESL

academic writing experts then reviewed the identified descriptors and sorted them into

dimensionally distinct writing skills. Based upon the experts‟ review and sorting

outcomes, the EDD checklist was constructed.

Phase 2










The primary purpose of Phase 2 was to pilot, model, and evaluate the EDD

checklist. Eleven ESL teachers participated in Phase 2, with seven participating in the

Phase 2 pilot study and ten participating in the Phase 2 main study. Six teachers

participated in both the pilot and main studies. The seven ESL teachers who participated

in the pilot study assessed 80 TOEFL iBT independent essays and preliminarily

evaluated whether the checklist functioned as intended. Once the functionality of the

checklist was determined, 10 ESL teachers participated in the main study to assess 480

62

TOEFL iBT independent essays and to evaluate the use of the checklist. The validity

assumptions that formulated the four research questions in Phase 2 were critically

examined from diverse perspectives using multiple data sources. In order to gain a

comprehensive view of the use of the EDD checklist, both quantitative and qualitative

data were collected and analyzed, and findings were integrated and synthesized in a

complementary manner.

Participants

TOEFL iBT Writing Test Participants

The TOEFL iBT writing test participants consisted of 480 ESL learners who took

the test at domestic (i.e., North American) or international test centers. Half of the test-

takers participated in the TOEFL iBT administration in the fall of 2006 (hereafter Form

1), and the other half participated in the spring of 2007 (hereafter Form 2). Test-takers

were 14 to 51 years of age (M=23.61, SD=6.40), and approximately the same percentage

of male and female test-takers participated in each test administration. Test-takers came

from 76 different countries and spoke 52 different languages as a first language. Test-

takers who spoke Chinese as a first language accounted for the largest number of the test-

takers, followed by Korean, Spanish, and Japanese (see Table 3). When the distribution

of test-takers was examined according to language group, the number of test-takers who

spoke non-Indo-European languages (59.58%) was greater than the number of test-takers

who spoke Indo-European languages (40.42%; see Table 4). Test-takers‟ primary reason

for taking the TOEFL was to enter a college or a university as either an undergraduate

student (18.13%) or a graduate student (21.04%).

Table 3

The Four Largest First Language Groups

First language Form 1 Form 2 Total

f % f % f %

Chinese 43 17.92 40 16.67 83 17.29

Korean 21 8.75 35 14.58 56 11.67

Spanish 29 12.08 23 9.58 52 10.83

Japanese 18 7.50 32 13.33 50 10.42

Total 111 46.25 130 54.17 241 50.21

63

Table 4

Distribution of Test-Takers by Language Groups

Language group Form 1 Form 2 Total

f % f % f %

Indo-European 106 44.17 88 36.67 194 40.42

Non-Indo-European 134 55.83 152 63.33 286 59.58

Total 240 100.00 240 100.00 480 100.00

ESL Academic Writing Teachers

Sixteen experienced ESL teachers were recruited from a college-level language

institute in Toronto, Canada. The recruitment process was followed based on an ethics

review protocol submitted to the ethics review board of the University of Toronto. All

ESL teachers were native English speakers with varying experience (2 to 25 years;

average 8.06 years) teaching ESL writing to adult learners. Eleven teachers held or were

pursuing a graduate degree in Applied Linguistics or Second Language Education, and

13 held a certificate in Teaching English as a Second Language (TESL). All teachers self-

assessed themselves as familiar with and competent in assessing the written English of

non-native English speakers. Eleven teachers also reported that they had been trained to

assess ESL writing. Nine teachers participated in Phase 1, 11 teachers participated in

Phase 2, and four teachers participated in both Phases. Of the 11 teachers who

participated in Phase 2, seven participated in the pilot study, ten participated in the main

study, and six participated in both the pilot and main studies. Detailed background

information about the ESL teachers is presented in Appendix B.

ESL Academic Writing Experts

Four doctoral students with substantial knowledge and research experience in

ESL writing (hereafter referred to as ESL writing experts) were recruited from a Second

Language Education Program at a research-intensive university in Canada. The ESL

writing experts included three males and one female. Two of the experts were native

English speakers, while the other two were a native Korean and native Arabic speaker.

All of the ESL writing experts had extensive research experience related to teacher

64

feedback, motivation, writing conferencing, process and assessment (see Table 5). They

also had varying experience of teaching ESL writing to non-native English speakers at

the university level.

Table 5

Profile of ESL Writing Experts

Expert Gary Jane Anthony Alex

Age 40-49 30-39 30-39 20-29

Gender Male Female Male Male

First

language English English Korean Arabic

Teaching

experience

12 years at

university level

4 years at

university level

3 years at

university level

3 years at

university level

Research

area in ESL

writing

Teacher

feedback &

student

revision, ESL

writing

curriculum

design and

program

development

Motivation in

ESL writers at

university level,

writing pedagogy

and assessment

Writing

conferencing via

computer-

mediated

communication,

feedback on ESL

writing

Collaborative

writing, ESL

writing process

and assessment

Note. Pseudonyms were used in order to obscure the experts‟ identities.

Instruments

TOEFL iBT Writing Samples

The writing samples used in this study were requested from the Educational

Testing Service (ETS) in New Jersey, U.S. ETS administered two forms of the retired

TOEFL iBT at various international and domestic test centers in the fall of 2006 and the

spring of 2007.13

The purpose of the TOEFL iBT is to assess test-takers‟ ability to

communicate effectively in English in an academic context, focusing on their language

skills in reading, listening, speaking, and writing. The test is delivered on computer via

the Internet and takes about four hours to complete all four sections. The TOEFL iBT

writing section consists of two tasks (one integrated and one independent), and responses

13

A retired test is one for which the operational form is no longer used and for which the items are

considered exposed.

65

must be typed into the computer. While the integrated task requires test-takers to write a

summary after reading and listening to a passage, the independent task requires test-

takers to write an essay based upon their knowledge and experience. The responses are

scored by two to four trained and certified ETS human raters according to a five-point

holistic rating scale.

ETS provided me with 480 TOEFL iBT independent essays written on two

different prompts (240 essays × 2 prompts) along with additional test-taker background

information. The two writing prompts are:

(a) Do you agree or disagree with the following statement? It is more important

to choose to study subjects you are interested in than to choose subjects to

prepare for a job or career. Use specific reasons and examples to support

your answer. (hereafter referred to as the subject prompt)

(b) Do you agree or disagree with the following statement? In today‟s world, the

ability to cooperate well with others is far more important than it was in the

past. Use specific reasons and examples to support your answer. (hereafter

referred to as the cooperation prompt)

Table 6 presents the score distribution of the 480 TOEFL iBT independent essays.

Each essay was rated by two ETS raters and the average was reported. Although few

essays were awarded a score of 1 or 1.5, the score distribution took an approximate bell-

curve shape. New four-digit code numbers were assigned to all essays, with code

numbers from 1000 to 1240 indicating essays written on the subject prompt, and code

numbers 2000-2240 indicating essays on the cooperation prompt.

Table 6

Score Distribution of the TOEFL iBT Independent Essays

Score Subject prompt Cooperation prompt Total

f % f % f %

1.0 1 0.42 2 0.83 3 0.63

1.5 3 1.25 2 0.83 5 1.04

2.0 18 7.50 20 8.33 38 7.92

2.5 32 13.33 25 10.42 57 11.88

3.0 65 27.08 55 22.92 120 25.00

66

Table 6 (Continued)

Score Subject prompt Cooperation prompt Total

f % f % f %

3.5 40 16.67 45 18.75 85 17.71

4.0 32 13.33 38 15.83 70 14.58

4.5 35 14.58 28 11.67 63 13.13

5.0 14 5.83 25 10.42 39 8.13

Total 240 100.00 240 100.00 480 100.00

Think-Aloud Verbal Protocol

A think-aloud verbal protocol (Ericsson & Simon, 1993) was developed to elicit

teachers‟ thought processes while they were providing diagnostic feedback on the

TOEFL essays. Appendix C outlines the procedures for each think-aloud session. The

instruction was carefully scripted in great detail, including intervention prompts and

follow-up interview questions. The protocol also included a teacher background

questionnaire.

Teacher Questionnaire and Interview Protocols

A two-part questionnaire was developed to examine how teachers evaluated the

EDD checklist (see Appendix D). The first part asked about teachers‟ (a) personal

background, (b) teaching experience, and (c) assessment experience, and the second

focused on teachers‟ (d) evaluation of the EDD checklist. In the evaluation section,

teachers were asked to determine whether each EDD descriptor was clear/not clear,

redundant/non-redundant, useful/useless, and relevant/irrelevant to ESL academic writing.

Open-ended questions were also included to further investigate the strengths and

weaknesses of the EDD checklist as well as the most or least important descriptors in

assessing ESL academic writing skills. An interview protocol was also developed to

discuss how teachers felt the use of the EDD checklist would impact their teaching and

assessment practices. The guiding interview questions are outlined in Appendix E.

67

Data Collection and Analysis Procedures

Phase 1

Think-Aloud Verbal Protocol Procedure

An individual meeting was set up with each of the teachers as guided by the

think-aloud verbal protocol (see Appendix C). Nine ESL teachers participated in

individual think-aloud sessions in which they verbalized their thought processes while

providing diagnostic feedback on 10 essays. In order to capture all instances of different

essay characteristics, three sets of 10 essays were carefully selected according to the

prompt on which they were written, the scores awarded by ETS raters, and the essay

length. Each essay set contained ten essays written on two different prompts covering a

wide range of score levels (5 essays × 2 prompts). Due to the small number of essays at

the score level 1 (n=1 for the subject prompt and n=2 for the cooperation prompt; see

Table 6), only Essay Set 3 contained essays awarded a score of 1. Essays awarded a score

of 1.5, 2.5, 3.5, or 4.5 were not selected because these essays revealed score

disagreement among ETS raters. Table 7 presents the score distribution of the 10 essays

in each set.

Table 7

Score Distribution of the Three Essay Sets

Score Essay Set 1 Essay Set 2 Essay Set 3

f % f % f %

1 0 0 0 0 2 20

2 2 20 2 20 2 20

3 4 40 2 20 2 20

4 2 20 4 40 2 20

5 2 20 2 20 2 20

Total 10 100 10 100 10 100

A textual analysis was conducted in order to examine the characteristics of these

essays. The VocabProfile English version 3.0 (Cobb, 2006) was used to calculate the

number of words, number of word types, percentages of K1 (the most frequent 1,000

word families), K2 (the most frequent 2,000 word families), and AWL (Academic Word

List) words, and lexical density. Spelling errors were corrected before the essays were

68

run through the program, so that misspelled words were not considered off-list. As

Appendix F shows, essays at the different score levels exhibited drastically different

profiles in terms of essay length and vocabulary sophistication. This result confirmed that

the three essay sets represented a wide range of ESL academic writing characteristics.

Three teachers were assigned to assess each essay set. The essays were randomly ordered

in order to counterbalance routine order effects. Appendix G provides the order of the

essays presented to each teacher.

When a think-aloud session was held, considerable attention was paid to the

timing in which teachers would provide their thought processes. Two possible verbal

reporting methods (i.e., concurrent and immediate retrospective) were introduced to and

chosen by the teachers. The concurrent think-aloud method required teachers to verbalize

their thought processes while reading and providing diagnostic feedback on essays with

no time delay and was thought to be effective in minimizing memory loss. The

immediate retrospective think-aloud method allowed teachers to read an essay first either

silently or aloud, and then to speak their thoughts aloud. Although the retrospective

method increased the potential for memory loss, it would improve concentration by

allowing teachers to read the essays without interruption. Teachers were provided with an

explanation of the two think-aloud methods, and were allowed to choose the method they

thought would work best for them. After trying both methods, three ESL teachers

selected the concurrent method, and six chose the retrospective method. The teachers

who preferred the retrospective method reported that the concurrent method interfered

with the reading process and did not effectively elicit their natural cognitive responses.

Each time the teachers completed a think-aloud report, they were interviewed in

order to clarify any unclear statements or ambiguous comments they had made. The role

of interviewer was minimized as much as possible so as not to unduly influence the

feedback. After they had completed the think-aloud process for all 10 essays, teachers

were asked to assign a mark to each using the TOEFL iBT independent writing rating

scale. A comparison of the scores awarded by ETS raters and those awarded by the

teachers was made in order to determine whether the teachers‟ assessments were

consistent with the ETS assessments and whether their verbal report data were a reliable

enough source to be used in creating an assessment tool. In the follow-up interview,

69

teachers were also asked what skills or strategies they thought should be diagnosed in

ESL academic writing.

Each think-aloud and follow-up interview session lasted two to three hours. With

the permission of the teachers, all verbal reports and interviews were tape recorded and

immediately transcribed. Teachers‟ background information was collected, along with the

scores that they awarded to the 10 essays. When each session was over, all of the

assessment materials were collected for security purposes. The verbal data were

transcribed using Microsoft® Word, and the score data were entered into Microsoft

®

Excel spreadsheets. When data from each tape recorded think-aloud session were

transcribed, any text read directly from an essay was italicized. Appendix H presents

excerpts from teachers‟ think-aloud verbal transcripts.

Analysis of Teachers‟ Think-Aloud Verbal Protocols

Teachers‟ verbal accounts and interview reports provided rich descriptions of

ESL academic writing ability and served as the base for constructing the pool of EDD

descriptors. Recorded verbal data were transcribed in full and then reviewed iteratively to

identify the distinct ESL academic writing subskills and textual features that would

constitute the EDD descriptors. Grounded theory (Glaser & Strauss, 1967) was the

principle methodology used to determine the emerging descriptors with varied properties

and dimensions. The analysis was done in several steps: first, transcripts of each

teacher‟s think-aloud verbal protocols were grouped under the same essay sets. The

transcripts were then divided according to essay score levels, with those describing high-

scored essays analyzed separately from those of low-scored essays. Finally, the

transcripts from the follow-up interviews were referenced when necessary for accurate

analysis.

Transcripts ranged from 5,422 to 8,504 words (i.e., from 9 to 17 single-spaced

typed pages) per teacher (see Table 8). Each transcript was read through, categorized, and

segmented into meaningful units using the computer program NVivo 8 (QSR, 2008). The

unit of analysis was one distinct evaluation theme that characterized ESL writing

subskills and textual features, and each evaluation theme represented one distinct EDD

descriptor. Ambiguous or hard-to-interpret evaluative comments were excluded from

70

analysis, while comments that were too general, such as “good language,” “good

introduction,” “I love the thought”, were disregarded because the analysis focused on

identifying fine-grained diagnostic evaluation themes. The transcripts were thus coded at

micro-level ESL writing skills in order to come up with specific EDD descriptors. In

assessing a writer‟s vocabulary knowledge, for example, several different aspects were

identified and coded (e.g., word sophistication, word variety, word choice, collocation,

etc.) instead of having one general evaluation criterion called “vocabulary”.

Table 8

Volume of the Teachers‟ Think-aloud Transcripts

Teacher No. of words Length of transcripts (pages)

Ann 8,504 17

Beth 7,192 13

Esther 5,422 9

George 7,915 13

James 6,597 11

Judy 7,382 12

Sarah 6,910 12

Shelley 8,873 15

Tim 6,252 10

Mean 7,227 12

Total 65,047 112

Note. Pseudonyms were used in order to obscure the teachers‟ identities.

The analysis of the transcripts resulted in a final total of 1,715 segments

representing 39 EDD descriptors. Each of the 39 EDD evaluative themes was then

reviewed based upon theories of ESL writing and a variety of existing ESL writing

assessment schemes developed by Jacobs et al. (1981), Hamp-Lyons and Henning (1991),

Brown and Bailey (1984), ETS (2007), and University of Cambridge, British Council,

and IELTS Australia (2007). Descriptors in these schemes were considered the

preliminary theoretical and practical guidelines that could be used to justify the EDD

descriptors. Along with this preliminary analysis, a more in-depth theoretical review of

the descriptors was conducted later by four ESL academic writing experts.

71

I coded all 1,715 segments, and a second coder independently coded the original,

uncoded, segmented transcripts of each teacher‟s think-aloud reports on two essays (515

segments; approximately 30.03% of all segments) in order to examine inter-coder

reliability. The second coder was a PhD student specializing in Second Language

Education with substantial knowledge of ESL writing. She was provided with a coding

scheme consisting of 39 descriptors and agreed definitions prior to beginning work.

When discrepancies occurred between the second coder and me, the areas of

disagreement were revisited and discussed in order to facilitate resolution.

ESL Academic Writing Experts‟ Descriptor Review and Sorting

Four ESL writing experts participated in a focus group meeting to review the

EDD descriptors elicited from the teachers‟ think-aloud verbal data. They were provided

with six TOEFL essays (3 essays × 2 prompts) along with 39 EDD descriptors. Once the

experts had read the essays and had a general understanding of the writing context, they

were asked to review each descriptor and to discuss whether it was clear/not clear,

redundant/non-redundant, useful/useless or relevant/irrelevant to ESL academic writing.

When necessary, the teachers‟ think-aloud transcripts were made available to them to

have a better understanding of the ways in which the EDD descriptors were elicited. The

writing experts were also asked to determine whether each descriptor was independent of

the others and conducive to making a binary (yes or no) choice or a four-point Likert

(strongly agree, somewhat agree, somewhat disagree, or strongly disagree) choice. When

the wordings of the descriptors were not clear, the experts edited them. After examining

each descriptor, they were also asked whether the descriptor pool was comprehensive

enough to cover all aspects of ESL academic writing. Any missing theoretical aspects

were added to the descriptor pool based upon existing theories of ESL academic writing.

The meeting lasted approximately two hours and was tape recorded in its entirety.

One month later, the same four ESL academic writing experts were invited to

individual meetings where they sorted the reviewed EDD descriptors into dimensionally

distinct ESL writing skills. This sorting activity was preceded by two phases: first, each

writing expert was asked to come up with his or her own skill identification scheme

while sorting the descriptors; then he or she was asked to sort the descriptors using the

72

predetermined sorting categories. The purpose of the first sorting activity was to examine

how ESL teachers conceptualize the underlying structure of ESL writing ability, while

the second identified the skills-by-descriptors relationship needed to construct a Q-matrix.

The predetermined sorting categories were developed based upon both empirical and

theoretical grounds; the teachers‟ think-aloud verbal protocols were used as guiding

empirical sources, complemented by theories that define and assess the construct of ESL

academic writing. I read each descriptor iteratively in order to identify the writing skills

that best represented the characteristics of the descriptors. When the skills were

empirically identified, they were sequentially compared and confirmed according to

theories of ESL writing and a variety of existing ESL writing assessment schemes.

During the sorting category finalization process, the following was taken into

consideration: 1) writing skills should be conceptually distinguishable from each other; 2)

each writing skill should have a minimal number of descriptors in order to be considered

for the statistical testing of dimensionality structures; and 3) writing skills should be

comparable to those specified in existing ESL writing assessment schemes for cross-

validation purposes. The sorting categories created through this process included five

skills: content fulfillment (CON), organizational effectiveness (ORG), grammatical

knowledge (GRM), vocabulary use (VOC), and mechanics (MCH). These writing skills

were consistent with the assessment components discussed in Chapter 2 (see Table 1) and

consistent with the assessment criteria described in Jacobs et al.‟s (1981) scale.

Each writing expert received a set of index cards on which the reviewed EDD

descriptors were reproduced. The experts were first asked to skim through these cards,

and then to sort them into piles that they thought represented distinct ESL writing skills.

This first sorting activity was conducted based solely on the experts‟ own skill

configuration. When they thought a descriptor was associated with multiple writing skills,

they labeled them as primary or secondary. In the second phase, the experts used the

predetermined sorting categories to assign the descriptors to appropriate skills. The

purpose of this second sorting activity was to construct a Q-matrix. Before they sorted

the descriptors, the sorting categories were explained and the experts were asked whether

they thought the five writing skills (content fulfillment, organizational effectiveness,

grammatical knowledge, vocabulary use, and mechanics) were comprehensive enough to

73

represent the content of all of the descriptors. Detailed definitions or descriptions of each

skill category were not provided, so that the experts‟ mapping of the descriptors onto

skills was not restricted. They were further asked to mark those descriptors that matched

with multiple or none of the skill categories. The sorting activity lasted approximately

one hour for each teacher, and their verbal accounts were tape recorded with their

permission.

The EDD checklist was constructed based upon the experts‟ review outcomes

derived from the focus group meeting. Overlapping concerns or suggestions were taken

into consideration when the descriptors were refined and finalized to constitute the EDD

checklist. The refinement process was iterative, and careful attention was paid to each

descriptor‟s wordings. When the refinement was completed, two marking boxes, labeled

yes or no, were attached to each descriptor to create a checklist form (see Appendix I).

The experts‟ sorting outcomes were also reviewed carefully in order to identify areas of

substantial agreement or disagreement. The result of the second sorting activity identified

the skills-by-descriptors relationships that finally constructed a Q-matrix.

Phase 2

Pilot Study

ESL academic writing teachers‟ essay assessment

Seven ESL writing teachers participated in the pilot study to assess 80 TOEFL

iBT independent essays. Forty TOEFL iBT independent essays were selected from each

of two essay pools using a stratified sampling procedure (40 essays × 2 prompts) and

formed into essay batches. Each essay batch consisted of 10 essays representing all

proficiency levels on the two prompts (5 levels × 2 prompts).14

Table 9 shows the

distribution of the essay batches assigned to the teachers. Each teacher assessed three

essay batches, with one essay batch (Batch 03) assessed by all seven teachers and the

remaining seven batches assessed by two different teachers.15

The three essay batches

assigned to each teacher were ordered by the prompt and counterbalanced. Three teachers

assessed essays that were written on the subject prompt first, and the other four teachers

14

The five levels were roughly determined because each level did not have equal number of essays (see

Table 6 for the score distribution of the essays). 15

Batch 03 functioned as an anchor set linking all assessment facets in the FACETS analysis.

74

assessed essays that were written on the cooperation prompt first.

Rater training was held prior to the teachers‟ essay assessment. The purpose of

the training was to orient the teachers to the EDD checklist, not to clone them to achieve

high inter-rater reliability. An individual meeting was set up with each of the teachers to

explain the purpose of the study and to outline the checklist‟s development procedure in

greater detail.16

Each descriptor was explained using concrete examples; the yes or no

option was also explained, and the difficulty of determining a cut-off for yes or no was

acknowledged. The general rule of thumb was that if a teacher thought that a writer

generally met the criteria of the descriptor, it was considered a yes; otherwise it was

considered a no. The term generally indicated the state in which a teacher did not feel

distracted or the teacher‟s comprehension was not compromised by a student‟s mistake

on the skill being assessed. The training was informal in order to minimize potential

psychological pressure that might affect the teachers‟ assessment.

Table 9

Distribution of Essay Batches in the Pilot Study

Teacher Essay batch

Angelina Batch 03 Batch 01 Batch 07

Ann Batch 03 Batch 05 Batch 08

Beth Batch 03 Batch 01 Batch 05

Brad Batch 03 Batch 04 Batch 06

Esther Batch 03 Batch 02 Batch 08

Susan Batch 03 Batch 04 Batch 07

Tom Batch 03 Batch 02 Batch 06

In order to ensure that the training had been successful, the teachers were asked

to assess one essay sample using the EDD checklist. Specifically, they were asked to (a)

make a yes or no decision for each descriptor, and (b) indicate their confidence levels on

each descriptor. The rationale for requiring teacher confidence levels was to identify the

descriptors that the teachers felt were difficult to use. They were asked to indicate their

confidence levels anywhere along the continuum between 0% and 100%, with 0%

indicating the lowest confidence level and 100% indicating the highest. While the

16

Training with one teacher was delivered via email because she could not attend on-site training.

75

teachers were marking the essay, they were left alone in a quiet room. The assessment

took approximately 15 minutes, after which a debrief session was held. Teachers were

asked to report any concerns with or suggestions for using the EDD checklist; retraining

took place, if necessary, depending on the gravity of these concerns. The entire training

session lasted approximately one hour.

Upon completing their training, the teachers were provided with an assessment

package containing (a) 30 essays (15 essays × 2 prompts), (b) the EDD checklist (see

Appendix J), (c) the assessment guidelines (see Appendix K)17

, and (d) the Teacher

Questionnaire I. They were asked to assess 30 essays, but to indicate their confidence

levels on just 10 essays (5 essays × 2 prompts) in order to save time and to make them

focus on the assessment itself. The turnaround time for assessment results was within two

weeks of the training date. Once the teachers had completed their assessments, they were

asked to fill out the Teacher Questionnaire I and were interviewed for 30 to 45 minutes.

The interview focused on their evaluations of the checklist‟s quality and effectiveness.

All assessment materials were collected after the assessments for security purposes. The

score data were entered into Microsoft® Excel spreadsheets, and the questionnaire data,

including teacher background information, and the interview data were entered into

Microsoft® Word. When score data were entered, a yes response was treated as “1” and a

no response was treated as “0”.

Preliminary analyses of the EDD checklist

The data collected in the pilot study were analyzed in order to examine the

validity assumptions concerning the use of the EDD checklist and to further fine-tune the

methodology of the main study. Three validity assumptions were examined:


teachers and essay prompts (Teacher and essay prompt effects).


of ESL academic writing (Correlation between EDD scores and TOEFL scores).


the potential to positively impact teaching and learning ESL academic writing

(Teacher Perceptions and Evaluations).

17

The assessment guidelines were carefully scripted so that they could be used as a reference for the

teachers.

76

a. Analysis of teacher and prompt effects

A Many-faceted Rasch Model (MFRM) was used to examine the extent to which

the student writing scores obtained from a sample of teachers on a sample of essay

prompts were generalizable beyond that specific set of teachers and essay prompts.

MFRM estimates the latent ability of a student while taking test conditions into

consideration. The rating probability for a particular student on a certain descriptor from

a particular teacher can be predicted mathematically from given facets, such as the ability

of the student, the difficulty of the descriptor, and the severity of the teacher. All facets

are placed on a single common logit scale, with the measurement units expressed as

logits.

The 7,326 valid ratings awarded by the seven teachers using the 35 descriptors

on the 80 essays were entered into the MFRM computer software, FACETS version

3.66.0 (Linacre, 2009).18

Ten essays were assessed by all seven teachers, and the

remaining 70 essays were assessed by two different teachers so that the data matrix was

partially crossed. In the model specification, four facets were specified: student, prompt,

teacher, and descriptor. While the student and descriptor facets were centered by

anchoring logit means at zero, the teachers were allowed to float because the analysis of

interest was focused on the teacher behavior in using the EDD checklist. The prompt

facet was entered as a dummy facet and anchored at preset values of 0.03 logits (for the

cooperation prompt) and -0.03 logits (for the subject prompt), respectively.19

Anchoring

was necessary in order to connect the two separate essay subsets in which each student

wrote a single essay on one prompt only. The preset values of 0.03 and -0.03 logits were

derived from a preliminary analysis that showed the subject prompt (difficulty measure =

0.03 logits) was more difficult than the cooperation subject (difficulty measure = -0.03

logits).

The analysis of the teacher and prompt effects was conducted using multiple

methods. First, teacher internal consistency was examined: teachers who exhibited

misfitting or overfitting rating patterns were detected based on infit and outfit mean

square values. In addition, inter-teacher reliability was examined in order to explore the

18

Each teacher assessed 30 essays using the 35 descriptors and there were 24 missing responses ([7

teachers × 30 essays × 35 descriptors] - 24 ratings = 7,326 ratings). 19

Dummy facets are intended to investigate interactions without affecting main effects (Linacre, 2009).

77

degree to which one teacher agreed with others when using the EDD checklist. Three

reliability indices were computed: (a) the percentage of exact agreement, (b) point-

biserial correlation, and (c) the percentage of the teachers‟ ratings that agreed on each

descriptor. Finally, a bias analysis was carried out in order to further investigate the ways

in which the descriptors interacted with the teachers and the prompts. Score

generalizability could not be examined across different prompts because students did not

write essays on both of the prompts (i.e., they only wrote one essay on one prompt).

Instead, the extent to which the EDD descriptors are biased for or against the prompts

was examined to determine whether the descriptors functioned consistently across

different essay prompts.

b. Analysis of correlation between EDD scores and TOEFL scores

A correlation analysis was conducted in order to examine the extent to which

scores awarded using the EDD checklist were consistent with those awarded using the

TOEFL iBT independent writing rating scale. Specifically, a Pearson product-moment

correlation coefficient was computed to estimate the strength of the association between

the logit scores elicited from the MFRM analysis and the original TOEFL iBT

independent writing scores awarded by ETS raters on the 80 essays.

c. Analysis of teacher perceptions and evaluations

The examination of teacher perceptions and evaluations of the use of the EDD

checklist focused on their reported confidence levels and their responses to the

questionnaire and in the interviews. The extent to which the teachers felt confident using

the EDD checklist was examined using descriptive statistics. A mean was calculated in

order to examine the degree to which the teachers felt confident in their assessments

across the 35 descriptors on 10 essays (5 essays × 2 prompts). The descriptors with the

highest and lowest confidence levels were also identified. To further explore the

relationship between teacher confidence and agreement, the two sets of scores were

plotted in the same graph. Teacher responses to the questionnaire and in the interviews

were also analyzed in order to examine how they judged the use of the EDD checklist.

The responses to the Likert-scale items were analyzed according to frequency, and the

78

responses to the open-ended items were analyzed descriptively. The teachers‟ written

comments and interview transcripts were read iteratively in order to identify positive and

negative reactions to the use of the EDD checklist. The interview results were then

integrated with those collected in the main study in order to develop a more

comprehensive picture of the teachers‟ evaluations.

Main Study

ESL academic writing teachers‟ essay assessment

The main study was carried out two months after the pilot study was conducted.

Ten ESL teachers assessed 480 TOEFL iBT independent essays using the EDD checklist.

These essays were divided into 40 essay batches, with each essay batch consisting of 12

essays representing all proficiency levels on the two prompts (6 essays × 2 prompts).

Table 10 shows the distribution of the essay batches assigned to the teachers. Unlike the

pilot study, it was not necessary to include a linking essay subset because the analytic

technique employed in the main study did not require a crossed data matrix. The teachers

were assigned four essay batches which were further divided into two assessment

packages. Each of the two assessment packages included 24 essays written on the two

prompts (12 essays × 2 prompts) and was counterbalanced. Five teachers assessed essays

that were written on the subject prompt first, while the other five teachers assessed essays

that were written on the cooperation prompt first.

Table 10

Distribution of Essay Batches in the Main Study

Teacher Essay batch

Angelina Batch 01 Batch 02 Batch 03 Batch 04

Ann Batch 05 Batch 06 Batch 07 Batch 08

Beth Batch 09 Batch 10 Batch 11 Batch 12

Brad Batch 13 Batch 14 Batch 15 Batch 16

Erin Batch 17 Batch 18 Batch 19 Batch 20

Greg Batch 21 Batch 22 Batch 23 Batch 24

Kara Batch 25 Batch 26 Batch 27 Batch 28

Sarah Batch 29 Batch 30 Batch 31 Batch 32

79

Table 10 (Continued)

Teacher Essay batch

Susan Batch 33 Batch 34 Batch 35 Batch 36

Tom Batch 37 Batch 38 Batch 39 Batch 40

Training took place in the same manner as in the pilot study. The teachers

engaged in individual meetings to discuss the purpose of the study and the assessment

procedure. Training with the four teachers who did not participate in the pilot study was

intensive, while training with the six teachers who participated in the pilot study focused

primarily on their questions and concerns about using the checklist according to the

revised assessment guidelines (see Appendix L20

). Upon completion of training, teachers

were given the first assessment package containing (a) 24 essays (12 essays × 2 prompts),

(b) the EDD checklist, and (c) the assessment guidelines. The teachers were asked to

assess 24 essays, but to indicate their confidence levels on just 10 essays (5 essays × 2

prompts). The turnaround for the first assessment results was within two weeks of the

training date. When the teachers returned their first assessment outcomes, they were

interviewed for 30 to 45 minutes to discuss the effectiveness of the checklist.

The second assessment took place two weeks after the first. Of the 10 teachers

who participated in the first assessment, eight went on to participate in the second

assessment. Two teachers were unable to participate for personal reasons, and the essays

assigned to them were scored by other participating teachers based upon availability.

Four teachers marked a set of 24 essays (12 essays × 2 prompts), and one teacher and

three teachers marked 48 essays (24 essays × 2 prompts) and 32 essays (16 essays × 2

prompts), respectively, to make up the assessments. The second assessment package

contained (a) essays written on two prompts, (b) the EDD checklist, (c) the assessment

guidelines, and (d) the Teacher Questionnaire II, and was distributed to the teachers with

a reminder that there should be at least a two-week interval between the first and the

second assessment; this was done to examine whether the teachers could use the EDD

checklist reliably, and to determine how their perceptions of the EDD checklist changed

over time. After the teachers completed the second assessment, they were administered

20

The assessment guidelines used in the main study were slightly revised based upon the teachers‟

comments in the pilot study in order to enhance the clarity of the descriptors.

80

the Teacher Questionnaire II and were interviewed. Unlike the first interview, this second

interview focused specifically on the extent to which the teachers thought the use of the

checklist would have a positive impact on classroom instruction and assessment. The two

teachers who were unable to participate in the second assessment round completed

Teacher Questionnaire II after the first assessment round. All interviews were tape

recorded with the permission of the teachers. When the entire assessment was completed,

all of the assessment materials were collected for security purposes. The score data and

the questionnaire data, including teacher background information, were entered into

Microsoft® Excel spreadsheets, and the interview data were transcribed using Microsoft

®

Word.

Main analyses of the EDD checklist

The data collected in the main study were analyzed in order to examine the

validity assumptions concerning the use of the EDD checklist. Three validity

assumptions were examined:


writing (Characteristics of the diagnostic writing skill profiles).





(Teacher perceptions and evaluations).

Each facet of the assumptions provided valuable information used to justify the

validity claims for use of the EDD checklist. The results derived from the justification

process were integrated and synthesized in a complementary manner.

a. Characteristics of the diagnostic writing skill profiles

The fundamental assumption of diagnosis modeling is that the test construct of

interest is multidimensional rather than unidimensional. Under this assumption, the

ability parameter is placed onto multidimensional space representing the skills-by-items

relationship. Before estimating several parameters using diagnosis modeling, both

substantive and statistical dimensionality analyses were conducted to ensure that the

construct of ESL academic writing is multi-divisible and the diagnostic approach is well

81

grounded. The substantive analysis was carried out based upon the outcomes of the ESL

academic writing experts‟ descriptor sorting activity, in which the refined EDD

descriptors were sorted into dimensionally distinct ESL writing skills. The statistical

analysis was also conducted using a series of conditional covariance-based

nonparametric dimensionality techniques. The ratings awarded by 10 teachers on 480

TOEFL iBT independent essays constituted the primary dataset for dimensionality

analysis.

Three nonparametric dimensionality tests were implemented in this study: (a)

DIMTEST (Stout, Froelich, & Gao, 2001), (b) CCPROX/HCA (Roussos, Stout, &

Marden, 1998), and (c) DETECT (Zhang & Stout, 1999). DIMTEST is a statistical

significance test that evaluates the null hypothesis that two sets of items taken by the

same examinees, AT (assessment subtest) and PT (partitioning subtest), are

dimensionally similar to each other. AT items are selected either in an exploratory or

confirmatory manner based upon theoretical considerations including expert review or

empirical data analysis, such as cluster analysis. When the null hypothesis is rejected, the

dimensionality test statistics, T, is referred to in order to estimate the magnitude of the

AT‟s dimensionality distinctiveness. A greater T value indicates a greater departure from

unidimensionality.

CCPROX/HCA is an exploratory item cluster analysis that neither conducts a

significance test nor provides the magnitude of multidimensionality. Instead, it presents

the dimensional structure of a test visually, with each item constituting its own cluster

and successively combining pairs of clusters which are thought to be dimensionally

homogeneous until all of the items are joined into one large cluster. Of the many methods

determining proximity between clusters, the unweighted pair group method of averages

([UPGMA], Sokal & Michener, 1958) has been known to provide the most accurate item

classification (Douglas, Kim, Roussos, Stout, & Zhang, 1999). In order to achieve the

best cluster solution, other dimensionality procedures (such as DIMTEST, DETECT, and

content review) have been recommended to be used in conjunction with CCPROX/HCA

analysis (Douglas et al., 1999).

DETECT is an exploratory or confirmatory nonparametric dimensionality

technique that estimates the number of dimensions present in a test and the magnitude of

82

multidimensionality. It also identifies dimensionally homogeneous clusters by calculating

the mean conditional covariance between all possible pairs of items in a test. The output

of DETECT analysis presents three useful indices including (a) DETECT index (or effect

size), (b) IDN index, and (c) r index. DETECT index is an overall conditional-covariance

estimator that indicates the magnitude of multidimensionality. According to Douglas et al.

(1999), when a DETECT index is less than 0.1, the test can be considered

unidimensional; an index between 0.1 and 0.5 indicates a weak degree of

multidimensionality; an index between 0.5 and 1.0 indicates a moderate degree of

multidimensionality; and an index between 1.0 and 1.5 indicates a strong degree of

multidimensionality. The other two indices, IDN index and r index, are associated with

the extent to which the data approximate simple structure, with values closer to 1

indicating the data closer to simple structure.

The latent dimensional structure of ESL academic writing ability was examined

in both exploratory and confirmatory manners. In an exploratory DIMTEST analysis, AT

items were selected using its built-in program, ATFIND, and were tested against the

remaining PT items several times until the DIMTEST failed to reject the null hypothesis.

Each time the null hypothesis was rejected, the initial AT items were removed from the

next run. The magnitude of multidimensionality was also examined in an exploratory

DETECT analysis. An exploratory CCPROX/HCA procedure further informed the

dimensional structure of the data. The result of the CCPROX/HCA analysis developed a

hypothesis of dimensionality and identified items that could be used as an AT set.

DIMTEST was then conducted iteratively with varying AT sets in a confirmatory manner.

The findings from the three different methods were complemented in order to determine

the dimensional structure of ESL academic writing ability.

Diagnosis modeling was then carried out using the Reduced RUM. The Q-matrix

developed by the ESL writing experts and the ratings awarded by the 10 teachers were

entered into the Reduced RUM computer software, Arpeggio version 3.1 (DiBello &

Stout, 2008). After the first Arpeggio run, model parameters were estimated using a

Markov Chain Monte Carlo (MCMC) algorithm. This procedure estimated model

convergence to the desired posterior distribution after discarding the burn-in steps. Three

different types of plots were visually inspected to determine whether the Markov Chain

83

for each model parameter converged to a stationary solution. These plots included (a)

estimated posterior distributions, (b) chain plots, and (c) autocorrelations of the chain

estimates.

Given that the model estimation had converged, descriptor parameters and

estimates for the ability distribution for the skills were evaluated. The descriptor (d)

parameter estimates *

d and r *

dk were examined in order to determine the quality of

each descriptor relative to its required skills. Specifically, the descriptor parameter *

d

was inspected in order to estimate the probability that students had correctly executed all

skills required by a descriptor on the condition that they had mastered all required skills.

The other descriptor parameter r *

dk was inspected in order to determine the extent to

which a descriptor discriminated for the corresponding skill. When a descriptor was

found to not contribute much information for distinguishing masters from non-masters of

a given skill, the skill-by-descriptor entry was eliminated from the Q-matrix. Refinement

of the Q-matrix was carried out iteratively in a stepwise manner based upon both

substantive and statistical evidence. In addition, the skill parameter estimates pk were

examined in order to determine whether the proportion of masters on each skill was

congruent with the skill hierarchy of ESL writing. When a skill turned out to be more

difficult or easier than suggested by ESL academic writing theories, the Q-matrix was

revised and the difficulty levels of the descriptors that were assigned to that particular

skill were examined. If necessary, the reassignment of descriptors to a skill was

considered.

After the parameters were estimated, model fit was evaluated using posterior

predictive model checking methods. A residual analysis was conducted to examine the

model fit. The mean absolute difference (MAD) between observed and predicted item

proportion-correct scores was computed, with a smaller MAD indicating a better model

fit. The fit between the observed and predicted score distributions was also visually

inspected, with the two score distributions plotted onto the same graph to facilitate

comparison. If a substantial discrepancy between the two plots was observed, further

analysis was conducted. In addition, the relationship between the number of mastered

skills and the observed total scores was examined. The monotonic relationship between

84

the two variables was assumed to support the claim for a good fit.

After model convergence and fit was achieved, the quality of the diagnostic

model was examined by testing several hypotheses. The first hypothesis tested whether

the estimated diagnostic model resulted in significant performance difference between

masters and non-masters. If the diagnostic model is well-constructed, it was assumed that

it would have the proportion-correct scores of masters distinguishably higher than those

of non-masters across all the descriptors. Descriptors with weak diagnostic capacity were

identified and further analyzed. The second hypothesis tested whether the estimated

diagnostic model could accurately classify examinees into appropriate skill mastery state

categories. The number of skill masters, the skills probability distribution, and the most

common skill mastery patterns were checked to examine the accuracy of the

classification. The third hypothesis tested the consistency of the skill mastery

classification. Simulated examinee item response data were used to estimate several

reliability indices. The fourth hypothesis tested the extent to which the diagnostic model

was affected by method effect. The skill mastery profiles generated by the model were

compared across the two essay prompts. The fifth and final hypothesis tested whether the

estimated diagnostic model resulted in significantly different skill profiles across

different writing proficiency levels. The 480 students were categorized into beginner,

intermediate, and advanced groups according to their TOEFL independent writing scores,

and the characteristics of their writing skill profiles were compared. In addition to

evaluating the five hypotheses, a case analysis was conducted to closely examine the

quality of the estimated skill mastery profiles.

b. Analysis of correlation between EDD scores and TOEFL scores

A correlation analysis was conducted to examine the extent to which scores

awarded using the EDD checklist were consistent with those awarded using the TOEFL

iBT independent writing rating scale. Specifically, a Pearson product-moment correlation

coefficient was computed on the observed scores awarded by the teachers using the EDD

checklist and the original TOEFL iBT independent writing scores awarded by ETS raters

on the 480 essays in order to estimate the strength of the association between the two sets

of scores.

85

c. Analysis of teacher perceptions and evaluations

The examination of the teacher perceptions and evaluations of the use of the

EDD checklist focused primarily on their responses to the questionnaire and in the

interviews. It should be noted that their confidence data could not be analyzed because of

too many missing responses. Teachers were reluctant to report their confidence levels in

the main study for two reasons: reporting confidence levels for all the descriptors

required too much time, and teachers felt that their ratings were affected by the act of

indicating their confidence levels. Some teachers also mentioned that this caused them to

feel monitored by the researcher. For these reasons, teachers‟ confidence levels were not

analyzed and reported in the main study; however, their questionnaire and interview

responses were usable. The responses to the Likert-scale items on the questionnaire were

analyzed descriptively according to frequency. The qualitative accounts on the

questionnaire and the interview were examined using a thematic data analytic method.

Teachers‟ questionnaire comments and interview transcripts were read iteratively, and

emerging themes associated with their evaluations were identified. Each theme was

constantly compared with others, with similar themes grouped together. The results from

the quantitative and qualitative analyses were then integrated and synthesized when the

study‟s findings were interpreted and discussed.

Summary

This chapter proposed five research questions formulated based upon the

reasoning process of validity arguments:










86



Each research question addressed each facet of validity inferences for the score-

based interpretation and use of the EDD checklist in ESL academic writing. It guided a

set of comprehensive procedures for the development of the checklist and for the

justification of its score-based interpretations and uses. A mixed methods research design

was chosen in order to build and support arguments that the EDD checklist assesses ESL

writing ability required in an academic context and provides fine-grained diagnostic

information about various writing skills. A series of validity assumptions determined the

types of data to be collected, which were then analyzed and synthesized using both

quantitative and qualitative methods. The next three chapters discuss the evaluation of a

series of validity claims for the use of the EDD checklist.

87

CHAPTER 4

DEVELOPMENT OF THE EDD CHECKLIST

Introduction

This chapter discusses the development of the EDD checklist conducted in Phase

1. One central validity claim was that the descriptors that constitute the checklist reflect

knowledge, processes, and strategies consistent with the construct of ESL writing

required in an academic context. In order to evaluate this assumption, fine-grained

descriptors representing ESL academic writing skills were empirically identified using

detailed verbal descriptions of ESL academic writing ability provided by nine ESL

teachers. The think-aloud verbal protocols were open-coded based upon grounded theory

and sequentially confirmed by theoretical accounts. Four ESL academic writing experts

reviewed and refined the identified descriptors to come up with the final EDD checklist.

In this chapter, the empirically-derived descriptors are systematically validated based

upon theories found in ESL academic writing assessment literature in order to make a

theory-based inference about the checklist‟s characteristics.

Identification of EDD Descriptors

Writing in a second language (L2) is a cognitively complex communicative act,

involving the use of multi-faceted and complicated language skills and knowledge that

directly and indirectly affect writing performance. This multidimensional view of L2

writing was evident from the think-aloud verbal protocols collected and analyzed in

Phase 1 of this study. Initial review of these protocols indicated that ESL teachers

considered a variety of subcomponents of ESL writing skills and knowledge when

determining the quality of an essay. The sheer amount of data they provided also

confirmed the depth and comprehensiveness of the teachers‟ accounts. The coding of

these protocols resulted in the identification of 39 recurrent writing subskills that formed

the descriptors of the EDD checklist (see Table 11). These descriptors were empirically-

derived, concrete, and fine-grained, addressing all aspects of ESL writing skills such as

content fulfillment, organizational effectiveness, grammatical knowledge, vocabulary use,

and mechanics.

88

Table 11 lists all 39 descriptors of ESL academic writing and the number of

times they occurred during think-aloud verbalization. The total descriptor tally was 1,715,

of which spelling (D3421

, 6.06%), essay structure (D9, 5.42%), verb tense (D22, 4.90%),

tone and register (D39, 4.78%), and essay clarity (D2, 4.72%) were the five most

frequently mentioned. By contrast, essay focus (D14, 0.87%), indentation (D37, 0.58%),

use of conditional verbs (D28, 0.52%), syntactic variety (D16, 0.35%), and paraphrasing

(D38, 0.29%) were the least frequently commented upon.

21

D34: Descriptor 34. Hereafter, the notation, “D + number” will indicate “Descriptor + number.”

89

Table 11

39 Descriptors of ESL Academic Writing Skills

Descriptor f %

1. This essay demonstrates an understanding of the topic and answers a specific question. 67 3.91

2. This essay is written clearly enough to be read without inferring or interpreting the meaning. 81 4.72

3. This essay is concise, containing few redundant ideas or linguistic expressions. 21 1.22

4. The beginning of the essay contains a clear thesis statement. 34 1.98

5. The main arguments in this essay are strong. 47 2.74

6. There are sufficient supporting ideas and examples in this essay. 24 1.40

7. The supporting ideas and examples in this essay are logical and appropriate. 64 3.73

8. The supporting ideas and examples in this essay are specific and detailed. 21 1.22

9. The ideas are organized into paragraphs and include an introduction, a body, and a conclusion. 93 5.42

10. Each paragraph is complete, with a clear topic sentence tied to its supporting sentences. 34 1.98

11. Each paragraph presents one distinct and unified idea in a coherent way. 28 1.63

12. Each paragraph links well to the rest of the essay. 18 1.05

13. Ideas are developed or expanded throughout each paragraph. 58 3.38

14. Ideas reflect the central focus of the essay, without digressing. 15 0.87

15. Transition devices are used effectively. 56 3.27

16. Syntactic variety is demonstrated in this essay. 6 0.35

17. Complex sentences are used effectively. 53 3.09

18. Normal word order is followed except in cases of special emphasis. 19 1.11

19. Sentences are well-formed and complete, and are not missing necessary components. 62 3.62

20. Independent clauses are joined properly, using a conjunction and punctuation, with no run-on sentences or

comma splices. 41 2.39

21. Major grammatical or linguistic errors impede comprehension. 42 2.45

22. Verb tenses are used appropriately. 84 4.90

23. There is agreement between subject and verb. 64 3.73

24. Singular and plural nouns are used appropriately. 40 2.33

90


Descriptor f %

25. Prepositions are used appropriately. 44 2.57

26. Articles are used appropriately. 52 3.03

27. Anaphora (i.e., pronouns) reflects appropriate referents. 51 2.97

28. Conditional verb forms are used appropriately. 9 0.52

29. Sophisticated or advanced vocabulary is used. 48 2.80

30. A wide-range of vocabulary is used, with minimal repetition. 16 0.93

31. The meaning of vocabulary is understood correctly and used in the appropriate context. 53 3.09

32. The essay demonstrates facility with collocations, and does not contain unnatural word-by-word translations. 25 1.46

33. Words change their forms where necessary and appropriate. 59 3.44

34. Words are spelled correctly. 104 6.06

35. Punctuation marks are used correctly. 66 3.85

36. Capital letters are used appropriately. 19 1.11

37. The essay contains appropriate indentation. 10 0.58

38. The essay prompt is well-paraphrased, and is not replicated verbatim. 5 0.29

39. Appropriate tone and register are used throughout the essay. 82 4.78

Total 1,715 100.00

91

When inter-coder reliability was examined, satisfactory agreement (450/515

segments, 87.38%) was found. Agreement at the individual descriptor level was also

reasonable, ranging from 70% to 100% (see Table 12). The areas of least agreement (i.e.,

< 80%) were idea development (D13, 70%) and use of collocations (D32, 70%),

followed by syntactic variety (D16, 75%), word sophistication (D29, 75%), word choice

(D31, 76.92%), and use of punctuation marks (D35, 78.57%). The second coder and I re-

examined the areas of disagreement in order to resolve discrepancies. Most

disagreements were reconciled following discussion. When no agreement was reached, I

decided which code would be assigned.

Table 12

Inter-Coder Reliability for the 39 Descriptors

Descriptor No. of segments No. of agreed segments Agreement (%)

D01 20 17 85.00

D02 20 18 90.00

D03 0 0 –

D04 27 24 88.89

D05 26 22 84.62

D06 10 8 80.00

D07 20 18 90.00

D08 4 4 100.00

D09 35 29 82.86

D10 15 12 80.00

D11 13 12 92.31

D12 6 6 100.00

D13 10 7 70.00

D14 0 0 –

D15 14 12 85.71

D16 8 6 75.00

D17 24 20 83.33

D18 4 4 100.00

D19 14 12 85.71

D20 10 8 80.00

D21 16 14 87.50

D22 26 26 100.00

D23 10 8 80.00

D24 24 22 91.67

D25 20 20 100.00

D26 10 10 100.00

D27 16 14 87.50

D28 4 4 100.00

92


Descriptor No. of segments No. of agreed segments Agreement (%)

D29 8 6 75.00

D30 4 4 100.00

D31 13 10 76.92

D32 20 14 70.00

D33 10 8 80.00

D34 30 30 100.00

D35 14 11 78.57

D36 0 0 –

D37 2 2 100.00

D38 0 0 –

D39 8 8 100.00

Total 515 450 87.38

Note. D03, D14, D36, and D38 achieved 0% agreement because the selected 515 segments did not include

these descriptors at all. As indicated in Table 11, the low frequency of these descriptors might have caused

a sampling problem.

Table 13 tallies the frequency of the 39 descriptors across teachers and essay sets

in greater detail. Overall frequency counts differed greatly from teacher to teacher: Ann

provided the greatest number of comments based on 34 descriptors, while Judy provided

the least number of comments based on 24 descriptors. In addition, the teachers who

were assigned to Essay Set 1 produced more comments than those assigned to Essay Sets

2 and 3. When frequency patterns were closely examined, counts seemed to be affected

by both the teachers‟ teaching experience and the length of the essay. Ann, the most

experienced teacher with 25 years‟ ESL writing experience, was assigned to Essay Set 1,

which included longer essays. Judy, with eight years‟ teaching experience, was assigned

to Essay Set 3, which included shorter essays. This conjecture is tentative at this point,

but further research may identify such possible relationships accurately.

In order to ensure that the teachers‟ think-aloud accounts were reliable sources

for the EDD checklist, the essay scores they awarded were correlated with the original

TOEFL iBT independent writing scores awarded by ETS raters. Appendix M presents

the correlation matrices for the nine teachers. The magnitude of the association was

strong, with Pearson product-moment correlation coefficients between pairs of scores

ranging from r = .75 to r = .98, p < .05. This result confirmed that the verbal protocols

that the teachers generated were reliable sources from which to construct an assessment

tool for ESL academic writing.

93

Table 13

Frequency of Descriptors by Teachers and Essay Sets

Descriptor Ann Shelley Sarah James Beth George Judy Tim Esther Essay Set 1 Essay Set 2 Essay Set 3 Total

D01 20 13 0 9 14 1 8 2 0 33 24 10 67

D02 5 17 10 7 6 8 10 16 2 32 21 28 81

D03 2 0 2 0 10 7 0 0 0 4 17 0 21

D04 1 5 12 5 2 7 0 0 2 18 14 2 34

D05 2 12 8 9 5 4 4 0 3 22 18 7 47

D06 0 6 2 2 4 7 2 1 0 8 13 3 24

D07 8 22 4 10 2 5 2 10 1 34 17 13 64

D08 5 7 0 2 1 2 0 2 2 12 5 4 21

D09 19 17 8 7 16 9 4 6 7 44 32 17 93

D10 8 1 6 4 7 7 0 0 1 15 18 1 34

D11 5 2 6 6 4 2 0 3 0 13 12 3 28

D12 7 1 4 1 5 0 0 0 0 12 6 0 18

D13 4 8 0 9 14 12 0 2 9 12 35 11 58

D14 0 0 0 2 2 11 0 0 0 0 15 0 15

D15 3 1 0 4 20 6 4 13 5 4 30 22 56

D16 0 0 0 3 1 1 0 0 1 0 5 1 6

D17 12 0 0 3 26 4 0 1 7 12 33 8 53

D18 3 0 0 0 2 3 2 3 6 3 5 11 19

D19 9 9 6 4 5 10 2 9 8 24 19 19 62

D20 3 5 4 3 5 8 2 9 2 12 16 13 41

D21 11 7 2 4 8 0 2 4 4 20 12 10 42

D22 10 9 6 1 20 9 10 12 7 25 30 29 84

D23 12 5 10 0 10 13 0 8 6 27 23 14 64

D24 2 4 4 0 2 10 12 6 0 10 12 18 40

D25 6 8 4 1 4 4 2 7 8 18 9 17 44

D26 11 0 4 0 5 17 6 2 7 15 22 15 52

94


Descriptor Ann Shelley Sarah James Beth George Judy Tim Esther Essay Set 1 Essay Set 2 Essay Set 3 Total

D27 18 3 2 2 4 5 4 9 4 23 11 17 51

D28 3 1 0 0 2 1 2 0 0 4 3 2 9

D29 4 7 4 1 8 15 2 2 5 15 24 9 48

D30 3 5 0 1 4 0 0 2 1 8 5 3 16

D31 7 12 0 6 7 6 6 7 2 19 19 15 53

D32 7 5 0 1 7 2 0 2 1 12 10 3 25

D33 9 7 6 2 3 24 2 5 1 22 29 8 59

D34 16 11 10 4 10 18 12 15 8 37 32 35 104

D35 14 2 6 2 7 5 10 8 12 22 14 30 66

D36 2 1 0 2 1 4 2 1 6 3 7 9 19

D37 0 2 2 1 0 2 0 3 0 4 3 3 10

D38 0 2 0 0 0 0 0 2 1 2 0 3 5

D39 26 11 14 4 1 13 8 0 5 51 18 13 82

Total 277 228 146 122 254 262 120 172 134 651 638 426 1715

Note. Essay Set 1 was assigned to Ann, Shelley, and Sarah.

Essay Set 2 was assigned to James, Beth, and George.

Essay Set 3 was assigned to Judy, Tim, and Esther.

95

The 39 descriptors were reviewed based upon theories of ESL writing and a

variety of existing ESL writing assessment schemes before they were subjected to the

academic writing experts‟ substantive review and refinement process. As discussed below,

it was theoretically and practically reasonable for each of the descriptors to be included

in the EDD checklist. The next section discusses each descriptor along with empirical

and theoretical accounts related to ESL academic writing, as well as the ways in which it

manifests the quality of ESL academic writing.

Descriptor 1: This essay demonstrates an understanding of the topic and

answers a specific question.

The first descriptor that drew teachers‟ attention was whether the writer

addressed the given topic. Milanovic et al. (1996) termed this as task realization in

reference to the extent to which an essay meets the criteria set forth in the essay question.

Although it would seem to be a rudimentary requirement, seven teachers mentioned topic

fulfillment across all three essay sets, with a frequency of 67 (see Table 13 for frequency

tallies). For example, Ann described an essay in which the writer simply changed the

topic:

Again, that‟s off topic, you‟ve told him to decide and support it with reasons and

examples, and he‟s gone off into changing the topic. (Ann)

Beth and Judy also pointed out cases in which the writers attempted, but failed to

answer the question:

And this person starts talking about themselves and ends up talking about other

people, so again focus on the question, what is the question asking, and is the

question answered. (Beth)

And content… the content‟s fine. To me, he or she is staying with the same topic

about getting a good job, and so they explain about how it‟s important to study at

university, try to find what you‟re interested in, and hoping that if you do well in

your studies you get a good job, and then talking about how it‟s hard to negotiate

because if they choose a certain subject for the job they won‟t be interested in it

for the time, um, but then hoping they‟ll be able to change in the future. But as

far as, does it answer the question? I‟m not sure it does. (Judy)

96

Another consistent focus was whether the writer answered the question

completely. Some essays written on the cooperation prompt provided an incomplete

answer to the topic. James‟ verbal report exemplifies this well:

This one is more on topic than the first one. This one, the task is to talk about “In

today‟s world, the ability to cooperate well with others is more important than it

was in the past,” the previous one doesn‟t mention the past really and this one at

least answers the question about the past. (James)22

Descriptor 2: This essay is written clearly enough to be read without inferring or

interpreting the meaning.

The second descriptor was related to the overall clarity of an essay, assessing

whether it read easily, with no extra effort required on the part of the reader to

understand the writer‟s meaning. This evaluation criterion is similar to what Hamp-

Lyons and Henning (1991) called communicative quality. All nine teachers responded to

the overall clarity of the essays, with frequency counts reaching 81; indeed, overall

clarity was one of the five most frequently-mentioned descriptors. In one case, Sarah and

Esther commented that it was necessary to reread the text to fully understand it:

A little bit… my first read-through of the sentences, yes I didn‟t understand, I

had to reread but after I reread it then I understood. (Sarah)

Okay, my global feeling on that one would be I need to go back and try to read it

without talking out loud. (Esther)

Similarly, Shelley and Ann reported that they had to guess the writer‟s intention:

A lot of kind of summaries of what this person thinks, people think, but no clear

sense of what the writer thinks. I mean in the conclusion… um… it becomes

clearer. So, you have to guess as you read what the writer‟s real opinion is in the

sense of what it might be. (Shelley)

I keep thinking. Maybe it is good and maybe I‟m just not getting it. It‟s very

nebulous, but he needs to condense it, he needs to be clearer in his production.

When he says, the argument should perhaps lay with the importance of studying

in itself. (Ann)

22

Italicized transcripts indicate text read directly from an essay.

97

Descriptor 3: This essay is concise, containing few redundant ideas or linguistic

expressions.

The think-aloud verbal reports indicated that not all teachers considered

conciseness a primary concern when judging essays: just four teachers in two essay sets

pointed it out, with a total frequency of 21. Nonetheless, existing rating schemes do

recognize conciseness as an important essay quality. For example, Jacobs et al.‟s (1981)

rating scheme describes succinctness as a key variable affecting the coherence of written

text. Despite the low frequency count, teachers in this study conceptualized conciseness

in two different ways: idea conciseness and linguistic conciseness. George‟s verbal report

illustrates the former:

So the ideas are expressed quite concisely in the first paragraph. (George)

On the other hand, Beth, Judy, and Tim pointed out redundant linguistic

expressions:

I would probably consider that redundant, and have them incorporate it into the

topic sentence of the first essay, or into the thesis statement. (Beth)

But when it comes to this question I think it is hard to say which one is important,

people should consider „both‟ these „two‟ things carefully and make their own

choose, redundant. (Judy)

But I would like to make myself clear, that‟s a bit of a redundant phrase in that

it‟s unnecessary because obviously by writing you‟re doing that. (Tim)

Descriptor 4: The beginning of the essay contains a clear thesis statement.

The teachers felt that a clear thesis statement was a necessary aspect of good

writing. A well-formulated thesis statement functioned as an essay‟s road map, guiding

readers to the central idea on which the rest of an essay was built. It usually appeared at

the end of the first paragraph of an essay to preview the essay‟s main idea. The

importance of a thesis is also described in Jacobs et al.‟s (1981) ESL academic writing

profiles. Seven teachers commented on a thesis statement across all three essay sets, with

a frequency of 34:

I don‟t see any sort of overriding thesis statement or no main, um, statement,

outlining his or her argument as to what he‟s going to say, so I see that as a bit of

a weakness in this introductory paragraph. (George)

98

Um, what else, so I think in terms of the question, the person has not really taken

a side, has not stated that they agree or disagree, but they‟d try to say that, it kind

of, what did they say, they said that it‟s hard to say which one is more important

and that people should consider both sides. I think the person could‟ve spent a

little bit more time on their thesis. (James)

And my argument is that the ability to cooperate well with others is far less

important than it was in the past, so there‟s a clear statement of what the

argument is, what this person‟s opinion is. (Shelley)

It‟s got a topic or thesis statement, it doesn‟t have controlling ideas in the thesis

statement but still it has a thesis statement and then the body supports the thesis

statement so that‟s good. (Sarah)

Descriptor 5: The main arguments in this essay are strong.

The writer‟s ability to present a strong argument may be the most critical

content-related evaluation criterion because the argument will significantly enhance or

downgrade essay quality. This is why Hamp-Lyons and Henning (1991) included

argument as an independent evaluation criterion when constructing ESL communicative

writing profiles. Eight teachers in this study indicated that they considered argument to

be an important factor in the determination of essay quality, resulting in 47 comments

across all three essay sets. For example, Esther focused on an argument that remained on

the fence:

I think they‟re trying here to, um, they‟re trying to hedge their bets. It‟s not a

great argument, it‟s a bit wish-wash. It‟s not a great argument, not a

sophisticated argument. (Esther)

The argument is not strong, not at all. It‟s sitting on the fence argument. Again

the other one was saying why can‟t we do both, but it was sophisticated,

they‟re… I think they‟re trying to fuse the idea that these things could go

together… it‟s just not done in a sophisticated way… yet. (Esther)

Shelley also focused on the strength of the argument:

And it‟s not a general statement that‟s really accepted as usually true, so it‟s a

very weak argument with nothing to support it. (Shelley)

It gives an actual, a more objective reason for his opinion, so the first kind of

argument or the first reason is just my opinion, saying this is in my opinion. This

one has a little more objectivity to it, interests can become your career, if you

choose interest you can still be choosing a career you like, it‟s a stronger

argument and it uses an example. (Shelley)

99

Descriptor 6: There are sufficient supporting ideas and examples in this essay.

An essay‟s content features might also be measured through quantification. For

example, Kepner (1991) assessed the ideational quality of text by counting the number of

higher-level propositions. Similarly, Friedlander (1990) assessed the production of

content-related ideas based upon the number of specific details. In Hamp-Lyons and

Henning‟s (1991) rating scale, referencing was measured based on the number of

examples in an essay. Lumley‟s (2002) study also promoted the importance of this

assessment criterion, with raters focused on the quantity of ideas in writing even when

this was not specified in the rating scale. In this study, teachers found that sufficient ideas

and examples made the essays stronger. Seven teachers provided 24 comments on this

content feature across all three essay sets:

And just also giving more support behind your ideas because it‟s very minimal

so it‟s very brief. More support behind the ideas would be important. (George)

I‟m just not sure how many examples they really use to support. (Judy)

The support, not enough support, so first reason, this person based it on their

experience, but there‟s no reasons or details, so it would‟ve been helpful to say

for example, I want to be an engineer and I‟ve studied math and science,

something like that. (Shelley)

There‟s really only one reason. A good one would have three reasons for some

details. (Shelley)

Descriptor 7: The supporting ideas and examples in this essay are logical and

appropriate.

Logical and appropriate ideas and examples have long been considered

important evaluation criteria in ESL academic writing (cf., Brown & Bailey, 1984).

According to Witte (1983b), low-quality text does not provide appropriate elaboration on

a topic and usually requires readers to infer the intended meaning. The teachers in this

study also focused on the extent to which writers presented logical or appropriate ideas

and examples, providing 64 comments across all three essay sets. As their verbal reports

demonstrate, they tended to point out illogical ideas or examples:

My biggest problem with it for me is that it‟s very illogical. They‟ve also

supported what they‟ve said with some really interesting details, but it isn‟t

logical to say that individual means that we‟re alone and we don‟t require other

people. That‟s not what individual means. So, a basic lack of logic. (Shelley)

100

The evolving society thing to me makes no sense either. One does not need

anyone to deliver mail, we have mail delivery every day. Has learned how to

make an argument but hasn‟t thought about how to make the argument logical,

but I wouldn‟t even write that on the paper until I talked to the writer and said

„tell me more about what you mean.‟ I wouldn‟t want to judge it harshly but to

me it‟s not logical. (Shelley)

For example, where was it?... the one about the bank accounts, we all have our

own bank accounts so that means we don‟t really need people, well… we do

kind of need people to help us build our bank accounts. Each of the points has

something like that. That‟s a little bit weak, but you know when you‟re writing

an essay in 30 minutes, that‟s to be expected. I wouldn‟t mark off for that really.

(Sarah)

There were also cases in which supporting ideas and examples were barely

connected to an essay‟s central question. James pointed out that a writer began well by

answering a given question, but deviated from the focus of the writing when providing

relevant support in the body of the paragraph:

So the person is clearly taking a stand in answering the question that the

importance of cooperation is um, more necessary today than ever, say that, the

adage of no man is an island is even more true today than ever and they break it

into two examples, a successful example of cooperation and a not so successful

example of cooperation, but these two examples don‟t really help to answer the

question of whether cooperation is more important today than it was in the past,

both examples are historical, they‟re about historical topics like the cold war and

the war in Iraq but they‟re not really… there‟s no comparison of past and present

in them, there‟s no sense of why cooperation is more important today than it

would‟ve been back in some other time. The person starts to get that kind of an

idea in their conclusion so they start to say things like, um, because of economic

globalization, the interconnection of the economy, trade-banking and services

are all connected so today more than ever, cooperation is necessary but then the

two examples, everything in the body doesn‟t really connect to that. (James)

Descriptor 8: The supporting ideas and examples in this essay are specific and

detailed.

The presentation of specific details was also an important criterion that

determined the content-related aspects of an essay. Although seven teachers provided

only 21 comments about the necessity of concrete examples, detailed supporting ideas

and examples strengthened the writers‟ arguments and improved reading comprehension.

The teachers‟ think-aloud verbal reports describe this point well:

Again, he hasn‟t given specific reasons and examples. (Ann)

101

He provides a pretty sophisticated example that does support the topic. (George)

What‟s in a sense good about is that they are using some good examples.

They‟re giving concrete examples of what they‟re trying to prove. (Tim)

Um, one reason… gives the reason, gives an example with Toyota, another

reason. (Shelley)

Descriptor 9: The ideas are organized into paragraphs and include an

introduction, a body, and a conclusion.

It was evident that teachers in this study focused heavily on the overall design of

written text. All of the teachers provided a total of 93 comments across all three essay

sets, judging whether an essay followed the formal structure of an introduction, a body,

and a conclusion. Indeed, the ability to organize ideas into paragraphs was the second

most-frequently mentioned evaluation criterion. Beth and Shelley commented that they

paid particular attention to whether writers were able to outline their ideas using a well-

formulated paragraph structure:

My feeling is they have the ideas, but they haven‟t been able to sort of organize

and have a beginning, a middle, and an end. (Beth)

In terms of organization, she put all her reasons, instead of introducing them, she

did all the explanations she does for them in the opening paragraphs. She needs

to learn that the introduction just introduces the reason that she has. I think this is

probably a new paragraph and I have a sister because she‟s…, hit enter there.

(Shelley)

In addition, Ann and Esther pointed out that less-skilled writers sometimes used

too many paragraphs or none at all:

What he has done, um, totally over-paragraphed, he has no idea of how to group

like ideas. (Ann)

So on this one the first thing I noticed was that there‟s no paragraphing. That‟s

just kind of a global thing as I glanced down, oh okay, this looks like a lot within

the paragraph. The first thing I‟ll probably do is start to read through the whole

thing so I get a global sense of it. Again no paragraphing. (Esther)

Descriptor 10: Each paragraph is complete, with a clear topic sentence tied to

its supporting sentences.

According to Scardamalia and Bereiter (1987), a proficient writer makes good

use of main ideas to guide and structure their writing processes. An advanced writer

102

employs a topic sentence skilfully to present main ideas or claims, previewing what the

supporting sentences will be like (Fournier, 2003). Seven teachers in this study provided

34 comments on this topic. Shelley and Sarah positively evaluated essays that included

topic sentences along with their supporting ideas:

Within each paragraph, it has an excellent topic sentence. This information

within each paragraph relates well to the topic sentence of the paragraph.

Everything in capitalism is about capitalism, everything in inherent human

nature is about inherent human nature. As a writer it flows fairly well, gives an

argument, an example in paragraph two, so capitalism is this, capitalist society

this, the word individual means this, then does the example. (Shelley)

And each body sentence has a topic sentence, um, that‟s supported with

supporting ideas, so organization is good. (Sarah)

Teachers also felt that reading comprehension was more difficult when essays

lacked topic sentences. Beth‟s comment exemplifies this well:

Okay, so again, hard to follow, lack of topic sentences. Very… weak

introduction, um, I think my greatest problem with this one is lack of

organization, I don‟t have topic sentence. I can‟t determine the supporting

sentences. (Beth)

Descriptor 11: Each paragraph presents one distinct and unified idea in a

coherent way.

In Jacobs et al.‟s (1981) rating scale, a well-formulated paragraph contains a

single main idea presented in a coherent way, with each paragraph distinguished

conceptually from the others. If a paragraph contains more than one idea or the idea is

not distinct from those in other paragraphs, it will ruin coherence at the paragraph level.

Formulating cohesive paragraphs was also a concern in this study, with seven teachers

providing 28 comments across all three essay sets. Ann noted a case in which multiple

ideas were presented without enough links in a single paragraph:

Again he‟s…, there‟s no cohesion. Within this paragraph, the sentences started

off with the sister, not getting a job, now he‟s into being useful to your country,

and pay your life…, referring back to killing in the first paragraph. (Ann)

A single idea stretched across two paragraphs was considered confusing. Both

Sarah and James pointed out this problem:

Um, I would say, nowadays, love is different, having a wife and how you get her.

Um, with the technology you can go on websites to meet people. Maybe this

103

should be part of the first paragraph, if it wasn‟t set off on its own then it would

seem cohesive because the previous sentence, when he finds her he would need

to talk to the parents and the parents would have to see if he is convenient for the

girl. It‟s continuing along the same subject, so I suppose it is cohesive if it was

together but it‟s confusing because it‟s a new paragraph and I expect to see a

topic sentence and that‟s not one. (Sarah)

So each of the two body paragraphs doesn‟t really have a main idea to it either

because the second one the person will start talking again about being a soccer

player, so in the first paragraph they start it by talking about their dream to be a

soccer player. The second paragraph they continue talking with themselves as an

example being a soccer player. If I spend a lot of time looking at this, I can

probably figure out the difference between the two body paragraphs was, but it

doesn‟t immediately stand out. So, that‟s part of my impression that it‟s not

organized or not coherent, or based on a single main idea. (James)

Descriptor 12: Each paragraph links well to the rest of the essay.

In order to achieve coherence, each paragraph must relate logically to preceding

and successive paragraphs (Jacobs et al., 1981). Although not many teachers in this study

noted the issue of coherence between paragraphs, it did draw some attention: five

teachers commented across two essay sets, for a total frequency of 18. Ann pointed out

one case in which paragraphs were discrete, with no links between them:

His…there‟s no links between paragraphs, they‟re discrete, first second third…

(Ann)

Teachers commented positively when paragraphs were connected well. Sarah‟s

report illustrates this point well:

Okay, he‟s alluding back to the points he‟s made in the preceding paragraphs,

which is good, and simplifying them. Rephrasing, allows individuals to get on

with their lives. Good links. (Sarah)

Descriptor 13: Ideas are developed or expanded throughout each paragraph.

Idea or thesis development has long been regarded as an important criterion by

which to assess ESL writing. Brown and Bailey (1984) referred to it as logical

development of ideas, and considered it a subscale of their ESL academic writing scale.

Similarly, in their rating scale, Jacobs et al. (1981) noted that a well-written essay

develops and expands a thesis or main ideas into a paragraph unit to convey a sense of

completeness. In this study, seven teachers provided 58 comments on idea development

across all three essay sets. George commented that idea development was likely to be

104

associated with the length of writing; he specifically indicated “little writing” and

“volume of writing” in relation to paragraph development:

Okay, what I can see here is there‟s little writing, so because I do look at the

volume of writing that one can do in 30 minutes, seeing it‟s a few sentences

broken in very small paragraphs suggests to me there‟s some issue with

proficiency level in the way it‟s written, because paragraphs don‟t usually just

have two sentences, but they need more to expand the writing or expand the

paragraph development because the paragraph development is really weak.

(George)

He needs to go. I can see the gems of it, people do not need to talk to each other,

you can see where he‟s going, but he just hasn‟t expanded it enough. (Ann)

Esther commented that linguistic resources are an essential tool to expand a

writer‟s argument:

They don‟t know how to develop their argument at all, they‟re repeating the

same sentence over and over again, there‟s a hint to me that they want to talk

about business and they want to compare it to the past… the way we can

cooperate in the past and future… but they don‟t have the language to do it and

they don‟t know how to develop it. (Esther)

Descriptor 14: Ideas reflect the central focus of the essay, without digressing.

Sperber and Wilson (1986) suggested that coherence is affected by the extent to

which relevant information is given to a particular context. Similarly, Fischer (1984)

argued that pertinence is an index that determines the overall impression of an essay in

foreign language writing. The importance of keeping an essay focused was also

identified in this study. Although only three teachers provided a total of 15 comments on

the issue of digression across two essay sets, loss of focus in essays was considered

problematic.23

The excerpts below illustrate the teachers‟ thought processes on

digression:

All the sudden he‟s talking about insuring successful business, so we‟re kind of

losing focus. (Beth)

Sometimes there‟s digression, whereas in English we like the writing to be more

concise, this, I see, is a little too…it digresses into his personal experience which

is interesting but it doesn‟t sort of, keep the focus. It loses the focus a little bit.

(George)

23

The low frequency in this category could be attributed to the fact that the concept of digression overlaps

with other coherence features. This issue is revisited when ESL academic writing experts‟ reviews are

discussed in the next section.

105

That is why the FBI and CIA…there‟s a bit of digression in this thinking. A lot of

focus on the Second World War… yeah, and then really digressing completely

from the topic… (George)

Descriptor 15: Transition devices are used effectively.

Despite heated debate on the relationship between cohesive devices and overall

writing proficiency (e.g., Evola, Mamer, & Lentz, 1980; Grabe & Kaplan, 1996), the use

of transition devices is considered critical in written discourse. Transition devices are

words or phrases that bridge a thought from one sentence to another or from one

paragraph to another. Good transitions connect ideas smoothly, with no abrupt jumps or

breaks, helping to create a unified piece of writing. Several types of transition devices are

used to move readers in the writer‟s intended direction, including (a) example, (b)

addition, (c) emphasis, (d) comparison, (e) contrast, and (f) cause and effect. Eight

teachers in this study noted the use of transition devices, providing 56 comments across

all three essay sets. Tim, Ann, and George commented that the correct use of transition

devices signalled subsequent ideas effectively:

But at least they use „thus,‟ so they‟re letting us, telegraphing to us that they‟re

now giving us the thesis. This is what they agree with it and what they‟re going

to prove in this essay is that is true, because they agree with that. (Tim)

As I said, he starts the last finally, where he‟s going to summarize and he‟s tried

to link it back to the points he‟s made. (Ann)

Then the framing of the second paragraph to begin with is great in the second

place, to sum up, so that shows the reader the different steps in the argument,

first, second, and the conclusion. (George)

Beth noted that when appropriate transition devices were not used, it was

difficult for readers to follow the text:

It‟s interesting because… first paragraph, personal reference, and then reference

to others, so that‟s okay… second paragraph is just personal reference… my

feeling is the first and second body paragraphs are almost contrasts but not able

to use that contrast, here I am, maybe that‟s why I‟m saying it‟s too simple to

begin with, although I gave up my dream to be a soccer player, I still believe that

the answer should be in the first place, people should also consider about their

future job. I guess what I‟d try to look for is on the other hand because they‟re

discussing two choices, making the comparison between the two potential

choices but comparison isn‟t coming out, it‟s a lack of grammar, that might just

be opinion based but it helps cohesion to see someone is actually making points

about the first choice, and making points about the second choice and then

106

maybe making their own conclusion, but that transition is inevident, so it‟s

harder to read. (Beth)

Descriptor 16: Syntactic variety is demonstrated in this essay.

Few existing rating scales exclusively measure syntactic variety in ESL writing.

Although the ETS (2007) writing rubrics and Jacobs et al.‟s (1981) analytic scale

describe syntactic variety and varied sentence types as the characteristics of good writing,

they are not assessed as an independent textual quality. This might be because those

qualities are interrelated with other syntactic features and automatically satisfied if a

writer achieves other syntactic effectiveness in written discourse. Syntactic variety might

therefore be co-constructed by the complementation of other essay characteristics. In this

study, only four teachers provided six comments on the extent to which a writer

demonstrated structural flexibility in the composition. The excerpts below present the

teachers‟ perceptions of syntactic variation:

I‟m not seeing a lot of demonstration of a variety of syntax and grammar. (Beth)

So yeah there‟s some… there‟s a pretty good variety, um, there‟s no passive

structures, but they don‟t need to be necessarily. It‟s not a problem that I would

notice right away. (James)

Descriptor 17: Complex sentences are used effectively.

Although the validity of complexity measures has often been questioned (Polio,

2001)24

, grammatical complexity is regarded as a critical measure in SLA studies that

determine the quality of ESL writing. It generally encompasses multiple dimensions of

variation, density, and sophistication, judging the presence of specific grammatical

features such as coordinate clauses, independent clauses, and dependent clauses. Fischer

(1984) referred to grammatical complexity as syntactic complexity in his written

communication rating subscale. The IELTS writing rubrics also assessed complexity in

its grammatical range and accuracy subscale.

The teachers in this study identified the need to assess a writer‟s ability to use

complex sentence structures effectively, with six teachers providing 53 comments across

24

As Polio correctly pointed out, the ability to produce complex sentences does not necessarily imply

high proficiency in writing, since essays composed of too complex sentences are not always good essays.

By the same token, more proficient learners who experiment with newly-acquired, complex linguistic

features can make more errors than less proficient learners (Fulcher, 1996b).

107

all three essay sets. Ann and George focused on sentence structure sophistication; George

in particular assumed that students‟ appropriate use of conjunctions and connecting

phrases was a hallmark of advanced writing skills:

The first reading I thought, my goodness, but then I looked at it, it had a lot of

complicated structures. They don‟t always come off but certainly they‟re used

appropriately, and they‟re very complicated. He‟s really trying for some

sophisticated language. (Ann)

Yeah, oh yeah, the writing is quite advanced because they have the conjunctions,

the connecting phrases that make the writing flow nicely, and sophisticated

sentence structure, you know, compound sentences. (George)

Tim noted a case in which the writer did not know how to combine ideas spread

across several simple sentences into one single complicated sentence. He recommended

that the write uses a relative clause:

I have a sister. She older than me. She finish her school now. Should make that a

relative clause, and could‟ve made one sentence out of those first three. (Tim)

Beth described a case in which the writer failed to create a complex sentence due

to inability to distinguish a dependent clause from an independent clause:

When you study subjects interesting to you, a dependent clause, inability to

distinguish between dependent and independent clauses, that‟s what I refer to

them as complex sentences. Inability to properly form complex sentences. (Beth)

Descriptor 18: Normal word order is followed except in cases of special

emphasis.

Different languages have different word order systems. For example, the Korean

and Japanese languages follow a SOV (subject-object-verb) system, whereas Arabic and

Hebrew follow a VSO (verb-subject-object) system. Word order in English follows the

SVO (subject-verb-object) rule, which is fixed at the sentence level (Celce-Murcia &

Larsen-Freeman, 1999). English word order is considered relatively easy to teach

compared to other English grammar rules. In this study, six teachers provided 19

comments on the ways in which writers ordered their words. Specifically, Ann and

Esther discussed the basic word order in English:

Word order, it‟s not a serious problem but there‟s several instances. (Ann)

Some of the word order is okay, like the subject verb is in order. There‟s a

subject and a predicate, but not always. Not always. (Esther)

108

George focused specifically on the writer‟s word order within a phrase:

A little word order here with adjective placement, so one of Mexico‟s top

colleges, sort of focusing on adjective order is important. (George)

Esther commented on the influence of first language on word order, noting that

one writer was definitely translating L1 text to English with no awareness of the English

word order system:

Hah…, I would explore with the student, word order. I think such thing is now

both, both now and past, exist. Yeah. I wonder… I wonder if…, again I would

know more when I knew the student but I wonder if the word order… in their

mother tongue, I‟d want to explore that with them. They‟re translating. I‟d go

back but I‟m wondering…, they‟re certainly translating. (Esther)

Descriptor 19: Sentences are well-formed and complete, and are not missing

necessary components.

When sentences do not contain all of their necessary constituents, they are called

sentence fragments. Although sentence fragments may be punctuated or capitalized like a

sentence, they are technically phrases or clauses. According to Hinkel (2004), separated

adverb clauses or prepositional phrases are the most common types of sentence

fragments found in student writing. All of the teachers in this study noted sentence

fragments, providing 62 comments across all three essay sets. Shelley reported a case in

which a sentence began with a conjunction:

Started with a conjunction so it made an incomplete sentence, needs to change it

to however, again another conjunction, so it‟s technically a sentence fragment,

but the fragment is created from starting a sentence with a conjunction. So I

would explain it that way, tell them that they could use however, and that would

make the sentence accurate, or make this clause part of the previous sentence.

(Shelley)

Another type of sentence fragment that occurred frequently in the essays was a

missing subject or verb. Ann and Sarah pointed out this problem:

One recurrent error seems to be the lack of a subject in phrases. „It‟ is important,

um, my country, „we‟ do not have courses. (Ann)

Well here the student forgot a verb, one reason why I think the ability to

cooperate well with others „is‟ more important today, forgot the word „is‟. Just

wrote others more important today, so that‟s an issue. (Sarah)

109

Descriptor 20: Independent clauses are joined properly, using a conjunction and

punctuation, with no run-on sentences or comma splices.

At the sentence level, the two most common grammatical errors are run-on

sentences and comma splices. Run-on sentences occur when two or more independent

clauses are joined with no punctuation or conjunction, whereas comma splices occur

when two independent clauses are joined with a comma but lack a coordinating

conjunction. All of the teachers in this study noted run-on sentences and comma splices,

providing 41 comments across all three essay sets. George felt that readers could become

lost in run-on sentences:

What I probably suggest is that there was that bit of a run-on sentence in that

second paragraph, and this is where editing comes in to keep the focus clear,

once the reader starts getting lost in the writing in the form, then the message

disappears. So what I‟d recommend, the writer put a colon here, anywhere, with

semi-colons and colons. I‟ll often see if there‟s a way to make it into two

sentences, make it more concise and efficient, and clear. (George)

Tim commented that too many thoughts in one sentence without appropriate

connectors interfered with reading comprehension:

Run-on sentences. Um, comma splices… they lack the ability to cut things short,

to get to the point. They‟re stringing too many thoughts together so it‟s really

hard to figure out what they‟re saying. (Tim)

That sentence also is a comma splice, trying to join two thoughts together with a

comma, which is not possible. They either need to create a new sentence, two

sentences or join it with a coordinator. (Tim)

Descriptor 21: Major grammatical or linguistic errors impede comprehension.

Hendrickson (1980) defined a global error as “a communicative error that causes

a proficient speaker of a foreign language either to misinterpret an oral or written

message or to consider the message incomprehensible with the textual content of the

error”, and a local error as “a linguistic error that makes a form or structure in a sentence

appear awkward, but, nevertheless, causes a proficient speaker of a foreign language

little or no difficulty in understanding the intended meaning of a sentence, given its

contextual framework” (p. 159). The teachers in this study commented that global errors

tended to obscure meaning, whereas local errors did not interfere with their

comprehension. Eight teachers provided 42 comments on this grammatical feature across

110

all three essay sets. The excerpts below present the teachers‟ perceptions of global and

local errors:

Okay. Yeah, very weak grammatically in that it interferes with my

comprehension, the country that elects this kind of cars. (Ann)

Um, okay, so again a few grammatical errors do not infringe on meaning but

causes the reader to have to pause and read again, for example the first sentence

of the fifth paragraph, a good example would be the current Iraqi situation of

political chaos where the presented Iraqi government council is supposed to be

representing, minor but… (Beth)

Unfortunately, the grammar errors are obscuring the comprehension of that.

(Tim)

Descriptor 22: Verb tenses are used appropriately.

The English verb tense system seems insurmountably complex from a cross-

linguistic perspective, and it requires considerable effort for ESL learners to master the

12 tense-aspect combinations. Celce-Murcia and Larsen-Freeman (1999) noted that the

use of tense-aspect-modality (TAM) can be fully grasped “only when we consider their

discourse-pragmatic and interactional features as well as their formal and semantic

features. The challenge of the English TAM system...is on use” (p. 174). The complex

nature of the English verb tense system is also found in Vaughan (1991). She noted that

raters consistently focused on problems with verb tense in ESL student writing, making it

the third most-frequently mentioned evaluation criterion. The teachers in this study were

also concerned about verb tense issues. Nine teachers provided 84 comments across all

three essay sets, making verb tense the third most-frequently mentioned evaluation

criterion among the 39 descriptors. As the teachers‟ reports demonstrate, some writers

had difficulty using a consistent verb tense to indicate one time frame. James and Judy

described this point well:

Um, then there‟s some grammar things, sometimes a present, past tense, when I

was a child, my dream „is‟ to be a soccer player. (James)

Last paragraph, for example, I was interested in chemistry university, and I finish

the bachelor with very good grades…, finish„ed‟ because it happened in the past,

university took place before. (Judy)

111

When a writer inappropriately expressed time references in describing

incidences happened in different time frames, it seriously impeded the readers‟

comprehension. Beth and Tim made this point well:

Um, attempt to use time phrases and cause and effect clause but not able to use

them to translate meaning, um, the war was still happen in the USA. So again

time references aren‟t correct and therefore the use of verb tenses are not correct,

and that example is, that means in the past two weeks, the war was still happen,

so in the past two weeks. This isn‟t about the past two weeks, and the war still

happen, again, incorrect use of grammar tense, jumping from grammar tenses,

jumping back and forth, inability to follow storylines, which is what they seem

to be trying to do here. That‟s, Thousand people might be unluckily. That sounds

horrendous. But it is true. And this, I‟m assuming, is a reference to the British

war, so I‟m assuming it‟s a reference to the past but an inability to express the

past, so inability to express ideas in the proper time is a great weakness. (Beth)

Sure, the first sentence, entirely from this thing, I think both the present and past.

We‟re talking about two time frames here. We need a verb that‟s going to agree

with both of them. We need a present and a past tense. (Tim)

Descriptor 23: There is agreement between subject and verb.

Subject-verb number agreement is a relatively easy concept for English learners

to master, although there are some exceptions to the rules. The straightforwardness of the

agreement rule thus barely impeded reading comprehension, even when that rule was

broken. Seven teachers provided 64 comments across all three essay sets. As the excerpts

below show, most errors occurred when writers attempted to create third-person singular

present verbs:

That is why technology exist„s‟, subject verb agreement, but from a content

perspective, I know what his opinion is and I‟m assuming what‟s going to follow.

(Ann)

So that is a good sense of the importance of organization but there are some

grammatical issues, subject verb agreement, someone totally do„es‟ not pay

attention… If someone „is‟ very interested, so again someone being singular,

seems a bit of a problem, the student has to know that someone is singular and

she/he decide„s‟ so again subject verb agreement, so we have grammatical

accuracy issues, subject verb agreements. (George)

Again in the first paragraph, the interaction between human beings have

increased should be has increased, person hasn‟t recognized…, subject verb

agreement is the word interaction, not the word means, a common error, a noun

followed by a prepositional phrase, which ends in a plural, people sometimes

assume that the word closest to the verb is going to be the one that makes the

112

agreement but in this case it‟s not because the interaction makes the noun there.

(Tim)

Descriptor 24: Singular and plural nouns are used appropriately.

English nouns have two different forms: singular and plural. Uncountable nouns

take a singular form, whereas countable nouns take both singular and plural forms. As

with other local grammar errors, misuse of singular and plural noun forms did not cause

serious misinterpretations of text. While it could be a recurrent problem in student

writing, teachers were usually able to guess the writer‟s intended meaning. In this study,

seven teachers provided 40 comments on this topic across all three essay sets. The

teachers‟ thoughts on this local grammar feature are presented below:

… due to low transportation, American had transportation, Americans, plurals.

(George)

… finished the bachelor degree with very good grade„s‟, plural for grades (Judy)

So the ability to cooperate well with others is generally necessary in everyday

lives, there‟s no need for a plural there. If you‟re going to say that in peoples‟

everyday life, if they have put that as the possessive adjective, then they could‟ve

used lives, but because of the way they‟ve written, it has to be a singular. (Tim)

Started out really well, but by saying I consider in today‟s world the ability to

cooperate well with other… should‟ve been others. (Shelley)

Descriptor 25: Prepositions are used appropriately.

Prepositions can have multiple meanings in different instances, and those

meanings are constructed in different ways (Celce-Murcia & Larsen-Freeman, 1999).

Taylor (1993) referred to this aspect of prepositions as “polysemous”. Their polysemous

quality renders prepositions difficult for even advanced ESL learners to master, although

their function is sometimes quite basic. Prepositions can also be combined with other

lexical units such as nouns, adjectives, and verbs to form a particular meaning. In this

study, nine teachers provided 44 comments across all three essay sets. As Tim noted,

students sometimes did not know the accurate meanings of “in” and “on”:

Problem with preposition. It‟s „in‟ the labor market, not „on‟ the labor market.

Prepositions are often difficult. It makes a huge difference because as soon as

you use the wrong one, you know something is quite off there. (Tim)

113

In Shelley‟s report, one writer combined two prepositions inappropriately, so

that neither functioned as intended:

In behind, you‟ve got two prepositions here, and probably neither of them works.

(Shelley)

In some cases, the writer did not know which prepositions should be paired with

particular nouns, adjectives, and verbs. Ann‟s comments illustrate this point well:

When we graduated high school, any interests „on,‟ it‟s more like, um, but

interest „in,‟ the combinations rather than just prepositions in general, graduated

„from‟ high school. (Ann)

Descriptor 26: Articles are used appropriately.

The definite article “the” and indefinite articles “a” and “an” are part of an

English reference and determination system that is notoriously difficult for students with

non-English backgrounds to master (Celce-Murcia & Larsen-Freeman, 1999). For this

reason, some researchers have argued that English articles are unteachable (Dulay, Burt,

& Krashen, 1982). Seven teachers in this study provided 52 comments on article use

across all three essay sets. Ann and Judy discussed it as follows:

He stated his argument again, problem with articles, both the definite and

indefinite, but he stayed in his premise, he said. Then he goes on to say I will

extend my article in three points. Again, articles. (Ann)

But if I choose a subject related to job or career, I‟ll not interest in job at the

time, of course I‟ll be happy to get the job directly, but in future I‟ll try to change

the area of work or I can develop the area of work or because not interest… um,

there‟s a missing article, if I choose the subject related to „a‟ job or „a‟ career.

(Judy)

Sarah commented that a misused article did not cause a major comprehension

problem:

There are only a few issues with um, articles using „the‟ when it‟s not necessary

but even that mistake is not huge because it doesn‟t impede communication and I

still clearly understand what the writer means. (Sarah)

Descriptor 27: Anaphora (i.e., pronouns) reflects appropriate referents.

Anaphora is a linguistic expression that refers to an antecedent, and an

antecedent is an object to which an anaphora refers in discourse. In English, an anaphora

typically takes pronoun form. Although the reference and pronoun system in English is

quite straightforward, misuse of an anaphora or omission of antecedents can make it

114

difficult for readers to understand a writer‟s intent. In this study, all of the teachers noted

the appropriate use of referencing, providing 51 comments across all three essay sets.

Esther pointed out that it was not clear what the pronoun “it” referred to:

Because I regret it, we don‟t know what it is and we don‟t know of course if they

regret answering the question or whether they regret that they didn‟t study

subjects they were interested in, so there‟s no reference. (Esther)

There was an occasional lack of consistency between pronouns. For example,

Ann and Tim commented that it was confusing when writers used multiple different

pronouns to refer to the same referent:

It‟s, um, consistency, like he‟ll talk about the people and then say he, or um, I

can‟t find the other one. There were a couple of other examples… I can‟t find it.

I‟ll just study economics even though he didn‟t have any interest on them. I mean,

that‟s hard to tell. (Ann)

Here we have a problem with this pronoun agreement, because he‟s talking

about studying what we‟re interested in. That‟s a singular, then when he goes

into the next one, he certainly brings the pronoun them, which what does it refer

back to? He doesn‟t give us anything. He is agreeing with that, then he goes on

to finish it, with an it. So we have started with a singular subject. We‟ve

switched to a plural pronoun and then to a singular pronoun which totally

confuses us, because we don‟t know what the them refers to. The it for the study

perhaps but the them throws it off. (Tim)

Descriptor 28: Conditional verb forms are used appropriately.

As with the tense-aspect system, it is difficult for English learners to have a full

syntactic and semantic grasp of conditional sentences. This could be because conditional

sentences consist of two clauses and imply three different kinds of semantic relationships:

(a) factual conditional relationships, (b) future conditional relationships, and (c)

imaginative conditional relationships (Celce-Murcia & Larsen-Freeman, 1999). Although

frequency in this category was quite low (five teachers provided nine comments),25

teachers in this study pointed out frequent conditional verb errors that writers committed.

The teachers‟ perceptions of conditional verbs are as follows:

And again his verb sequencing there, he needs some sort of conditional, they

would, they could, they should. (Ann)

25

The low frequency might be because the knowledge of tense-aspect subsumes that of conditional verb.

This issue is revisited when ESL academic writing experts‟ reviews are discussed in the next section.

115

A lot of modals are being used, um, and you know the conditional is being used.

(George)

Instead of I will not interest, I would… conditionals should be in there, I would

not be interested in that job. (Judy)

They could give up the job, a lot of this could thing, the use of could, um, that

conditional comes out in a lot of… especially in Korean writing and getting to

understand why you would not use could. (Shelley)

Descriptor 29: Sophisticated or advanced vocabulary is used.

Read (2000) defined word sophistication as “a selection of low-frequency words

that are appropriate to the topic and style of the writing, rather than just general,

everyday vocabulary,” which included “technical terms and jargon as well as the kind of

uncommon words that allow writers to express their meanings in a precise and

sophisticated manner” (p. 200). In general, word sophistication is measured as the

proportion of advanced words appearing in the text; however, there is a problem of

subjectivity with regard to what is actually considered advanced (Laufer & Nation, 1995).

The teachers in this study noted word sophistication in student writing, providing 48

comments across all three essay sets. As Beth‟s verbal report shows, teachers gave

positive evaluations to essays written using sophisticated words:

I like the attempt at using more sophisticated vocabulary, so, utterly, delved,

dazzling, um, in this regard, um, transparency. (Beth)

When writers used unsophisticated words, the teachers pointed them out

immediately: George and Shelley mentioned that general words such as “things” “bad”,

and “good” were too vague to effectively convey the intended meaning:

Instead of two things because things is so general, state specifically what you‟re

talking of and use phrases from the prompt. Because it just makes it more clear,

using a word like thing is so vague. (George)

Not sophisticated, probably you will be a bad professional, you will be a good

professional if you… we don‟t know what they mean by that, it‟s too general,

that to me is middle school, what would a bad professional be, what do you

mean by that. Not sophisticated. Needs a lot more depth but I think there‟s an

effort to go above the high school level. (Shelley)

I can hear like an 18 year old saying this stuff. It‟s not extremely sophisticated or

academic. (Judy)

116

Descriptor 30: A wide range of vocabulary is used, with minimal repetition.

Word variety refers to the degree to which a series of diverse words are

presented with the skillful use of synonyms, superordinates, and other related expressions,

without repeating the same words in a limited range (Read, 2000). According to

Linnarud (1986) and McClure (1991), the measure of word variety is calculated as the

proportion of different words to the total number of words in the composition. Compared

with word sophistication, fewer teachers in this study focused on word variety: six

teachers provided 16 comments across all three essay sets. As their verbal reports

illustrate, most teachers noticed when writers repeated words:

He‟s not using any substitutions, he‟s not saying „work together‟ or „participate,‟

or anything to replace „cooperate.‟ (Ann)

A lot of repetition, coordination, coordination, coordination, and so often that

bounces off the page quickly. (Beth)

What it looks like to me is, you‟ve got interest, interested, and interesting. What

this person is trying to do, is to use all those words to make his essay interesting,

but has failed because it just goes on and on. Three words the same in one

sentence, what is the point? (Tim)

I think the choice of the word drive is interesting. Talk about being driven…, it‟s

not something most students are being aware of, being able to use that way. I

think this person understands what they‟re doing. But I wish they had used

another word or explained what they meant by that, just because they used the

word drive over and over. It‟s not like okay, because driven is very strong so I‟m

not convinced that they know what they meant, they needed to put some variety

in there to make the idea more clear. (Shelley)

Descriptor 31: The meaning of vocabulary is understood correctly and used in

the appropriate context.

Even when writers use a wide range of sophisticated words in their writing, word

choice that imposes an incorrect semantic meaning can be problematic. As Laufer and

Nation (1995) rightly argued, the issue is not a rich vocabulary but a well-used rich

vocabulary that has a positive impact on the written text. In this study, eight teachers

provided 53 comments on word choice across the three essay sets. In one case, a writer

used a word that was semantically unrelated to the intended meaning:

117

Hence it makes little sense to study extravagant subjects, I mean it starts off

good but extravagant, it is an inappropriate adjective. Doesn‟t tell me anything?

It‟s totally out of place. Extravagant, we talk about material things. (Tim)

There were also cases in which a writer used a word that was semantically close

to the intended meaning, but did not feature in accurate meaning:

The vocabulary is very good, it‟s got a lot of good phraseology, um, good choice

of verbs… cut and dried question, ability to inspire, ignite parents‟ debate. Um,

my problem is, he seems to be throwing them in without quite knowing what

they mean. So it looks very impressive at first, then you think, „well…, wait a

minute, what does that actually mean?‟ So, I think the vocabulary, um, the good

phrases there, covers up problems with accuracy. (Ann)

This to me sounds like a student who‟s learned lists, okay, one there, one there,

without knowing the exact meaning of the vocabulary. Inspire students is good,

but inspire students‟ concern does not follow. A cut and dried question. No. No

matter knowledge or skills, there are obstacles awaiting us. As I said I think

reading, giving him substitution exercises where he has to put them in would

help. (Beth)

So I will bolster doesn‟t really work in this context, and it made it hard to figure

out what the opinion was. (Shelley)

Descriptor 32: The essay demonstrates facility with collocations, and does not

contain unnatural word-by-word translations.

Although it is notoriously difficult to define collocations and different definitions

abound in the literature (Leśniewska, 2006), they are generally understood as word

combinations that occur more than often that would be expected by chance. The

restrictiveness of collocation rules makes it challenging for ESL writers to use them

appropriately in written discourse. Indeed, Waller (1993, as cited in Leśniewska, 2006)

considered collocations the language feature that stigmatized “a foreign accent in writing.”

In this study, seven teachers provided 25 comments on the use of collocations across all

three essay sets. Ann and Shelley were particularly sensitive to semantically general

“high-utility” verbs (Leśniewska, 2006) such as “make” and “do” and their

accompanying nouns:

Do the opposite, make a career, those are the sort of verb-and-noun collocations.

That are often, you have to know them, it‟s very difficult to apply rule and work

it out, but the language sounds natural as well. (Ann)

118

I always point them out but I don‟t um… word choice errors, I did the best

choice rather than made the best choice, they need to understand it‟s like a

collocation, you make a choice not do a choice, or verb phrases, I often put them

as verb phrases. (Shelley)

Descriptor 33: Words change their forms where necessary and appropriate.

The correct use of word form was another criterion that determined overall

writing quality. Engber (1995) identified word errors focusing on derivations, verb forms,

phonetic and semantic associations, and spelling. Derivational errors occur when a writer

is not able to discern different word groups such as nouns and verbs, and errors in

phonetic and semantic association occur when a semantically unrelated word is

improperly used by analogy with phonetic similarity (Engber, 1995). Consistent with

Vaughan‟s (1991) findings that morphology errors are among the most frequent, the

teachers in this study all showed great concern with word forms, providing 59 comments

across all three essay sets. In particular, the teachers were most attentive to a writer‟s

knowledge of word groups:

Capitalism, capitalist, he knows his word groups. (Ann)

And make their own choose, here we have sort of a word form issue. (George)

… for instance I choiced my study, when your parents choiced your career,

that‟s actually the wrong form of the word. (Shelley)

Word form, for example, when any company needs to open any manufacture, the

word manufacture, maybe the students mean manufacturing company, some

other form but not that. Seems to me he‟s done that kind of mistake before, or

error, I don‟t know. Finally we can see safety, healthy, world, and happy life…

I‟m not sure if he meant, we can see a safe and healthy world, but I‟m not sure.

(Sarah)

Descriptor 34: Words are spelled correctly.

Despite its discrete and superficial nature, spelling was the most frequently-

mentioned evaluation criterion among the 39 descriptors. The teachers consistently

pointed out spelling problems, providing 104 comments across all three essay sets. As

the teachers‟ verbal reports illustrate, serious spelling errors often obscured the writer‟s

intended meaning to a great extent:

Again, the spelling totally threw me off there. (Ann)

119

Um, the one word I don‟t understand is dein or dyin themself. I think they mean

deny…, that is simply because in modern society, people deny themself…, oh,

sorry, it‟s define. The „f‟ is missing. Then „self‟ should be „selves,‟ to a great

extend is wrong, it should be „extent.‟ (Judy)

This is almost impossible to read because there‟s numerous spelling mistakes.

(Tim)

There‟re some spelling problems like the word example is spelled wrong. (Sarah)

Descriptor 35: Punctuation marks are used correctly.

Punctuation marks such as commas, full stops (periods), apostrophes, and colons

are the linguistic symbols that separate words into phrases, clauses, and sentences in

order to clarify meaning. Contrary to Milanovic et al.‟s (1996) findings that punctuation

was of little interest to the raters on the First Certificate in English (FCE) examination,

all of the teachers in this study paid a great deal of attention to punctuation in written text,

for a total of 66 comments across all three essay sets:

Punctuation is a bit iffy as well. You don‟t need commas in places where he‟s

putting them. (Ann)

Yeah, with the help of the internet, period, we can communicate with each other

easily, period, but before the internet, period, people contact with others only by

letters. So, inability to use punctuation properly which then does impact meaning.

In my opinion, quite basic and leads to misunderstanding, improper use of

punctuation… (Beth)

One thing that puts me off right away is that this person has an annoying habit of

putting a comma. It doesn‟t really…, and you know they don‟t put a space

between the end of the sentence and the beginning of the next sentence. (James)

Punctuation is all over the place, either they‟re not sure they must use a comma

or suppose to use a period, or they think it‟s the same thing. That‟s really critical.

Punctuation is critical. Comma is really important piece of punctuation. If you

don‟t know where to use it in place of a period, like in the second paragraph, it

looks to me like, um, after the word success there, that should be a period,

starting a new sentence. But the way it‟s written, the capital goes before, which

beginning example which would not be a new sentence. (Tim)

Unlike other mechanical problems, punctuation use can be somewhat

complicated because of its interrelationship with syntactic structure. As Shelly and Tim

pointed out, the misuse of punctuation marks can cause problems at the sentence level:

120

I‟d have to go back, the punctuation is fusing sentences, dependent clauses,

independent clauses but I just want to look at a couple of those to see if it‟s just

the punctuation, or if I ignore that punctuation, is the syntax right? (Shelley)

So I would have to go back on this one and I would have to go through these

sentences again, in order to see whether in fact, it is the punctuation that is so

bad. Of course, there are no sentences here because of how it‟s punctuated, but

I‟d want to go back and see if there‟s a clear subject predicate for most of the

sentences. (Tim)

Descriptor 36: Capital letters are used appropriately.

Although it was not a major linguistic feature in the composition, capitalization

was noted by eight teachers who provided 19 comments across all three essay sets. As

James indicated, misuse of capitalization did not serve to distract from the rest of an

essay:

It‟s not much, even the second paragraph where he‟s talking about the master‟s,

it should be a capital. (Ann)

i prefer and i not being capitalized. (George)

the average american, american, capitalization, had left the of rest. (George)

So, I also notice there‟s like, you know the person doesn‟t capitalize I, so there‟s

some very basic conventions that they don‟t follow, um, so they should um, the

teacher giving feedback, because of such a relatively simple convention to

correct, I would say always capitalize your I. I make that a point with a student

because it‟s something that can be achieved rather easily. But it‟s not the most

important thing to be distracted about when you‟re marking. (James)

Descriptor 37: The essay contains appropriate indentation.

Relatively few teachers cited formatting issues, with five teachers providing 10

comments on indentation across all three essay sets. They noted that indentation would

have helped to create a better visual layout of the paragraph structure. Tim and Shelley

made this point well:

I think that‟s the first thing that just should be taught. Indenting every time you

have a thought, that‟s great but… (Tim)

I‟ll call this paragraph one, the first sentence. They haven‟t indented paragraphs

so it‟s hard to tell where paragraphs are. (Shelley)

121

Descriptor 38: The essay prompt is well-paraphrased, and is not replicated

verbatim.

Three teachers provided five comments on paraphrasing across two essay sets.

As Esther and Tim noted, some writers just replicated the prompt without rephrasing it in

their own words:

They were even able to take the prompt and they didn‟t just replicate the prompt.

(Esther)

I agree with the statement that the ability to cooperate well with others is far

more essential in today's world than it was in the past, of course that‟s taken

directly from your question. (Tim)

I think also everywhere in this society. To be specific…, okay, here we go, again

they‟re repeating what‟s in the prompt. (Tim)

Descriptor 39: Appropriate tone and register are used throughout the essay.

Researchers have suggested that writing is a productive endeavour that is

socially constructed between an individual writer and a particular context or culture.

Grabe and Kaplan (1996) incorporated genre knowledge and audience considerations

into their writing skills taxonomy to reflect the significant role of sociolinguistic

knowledge in writing. Swale (1990) also recognized the importance of genre knowledge

in academic writing. One component that constitutes sociolinguistic competence is the

knowledge of register (Bachman & Palmer, 1996). According to McCarthy (1990) and

Read (2000), register governs vocabulary choice and manifests the social dimension of

vocabulary. From an assessment perspective, Jacobs et al. (1981) and Brown and Bailey

(1984) argued that vocabulary knowledge subsumes knowledge of register, measuring

whether the vocabulary is appropriate to the audience or the tone of text.

The teachers in this study showed great interest in the appropriate use of tone

and register in written text: eight teachers provided 82 comments across all three essay

sets. As Ann pointed out, the use of first person “I” is not appropriate in academic

writing:

Um… too many „I‟s, it‟s, um, every sentence is „I,‟ „I,‟ „I,‟ so again tone and

register, totally inappropriate. (Ann)

In cases in which informal vocabulary was used, teachers commented that use of

colloquialisms or casual words are not appropriate in an academic essay:

122

In a nutshell, it‟s too colloquial. Instead, to summarize, in summary, to conclude,

um, he could even, you know, to put it simply, any of those, but in a nutshell it‟s

too familiar. (Ann)

She wrote a lot, but writing doesn‟t necessarily mean anything. Yeah, she wrote

a lot and um, she also seems like somebody to me whose speaking ability is

better than writing ability, or academic writing. Because this is quite informal,

using informal…, she‟s writing the way she would speak to a friend. You know,

to put it simply, we need money, that‟s pretty informal. And she starts a next

sentence, but over the years, the business began growing, which is also informal.

You know she‟s telling a little story here, a narrative which isn‟t exactly an

appropriate, um, rhetorical structure for this type of essay. (Sarah)

George focused more on the use of punctuation marks in academic writing

conventions, particularly bullets and exclamation points:

Okay, and I would tell this student, having bullet points like this isn‟t really

appropriate for this register of academic writing, you need to just embed the

points in that last sentence because first of all they‟re brief bullet points, if they

were long it would be different. They‟re very brief, it looks good and also jars

the reader a little bit because they are so brief, it‟s not expected with this type of

writing. (George)

And also informal use… these exclamations and the um, multiple questions, I

don‟t mind a question or two at the beginning to interest a reader but there are

quite a few questions in it and sort of informal use like I bet you do and again

this use of „you‟ in this academic context is a bit informal. (George)

Characteristics of EDD Descriptors

The descriptors showed that teachers paid considerable attention to the extent to

which writers satisfactorily addressed the given topic. Although content fulfillment was

not included in the research scope of traditional SLA studies examining ESL writing,

contemporary L1 and L2 writing theories do agree that it is a central component of good

writing. For example, Grabe and Kaplan (1996) considered topical knowledge or

knowledge of the world to be a parameter that determined writing performance. Similarly,

research on rater perceptions and behaviours has verified content fulfillment as an

important consideration. These empirical and theoretical accounts support the idea that

an important criterion of written text assessment is the extent to which an essay fulfills

content requirement.

123

Teachers also felt that organizational effectiveness determined the quality of an

essay. This finding was reasonable because the ability to coherently organize ideas has

long had a place in writing instruction and research; Halliday and Hasan (1976)

conceptualized cohesion and coherence as ways in which textual structure is tied together

in extended discourse. Similarly, Canale (1983) and Grabe and Kaplan (1996) suggested

that unified text can be attained through cohesion in form and coherence in meaning.

Most analytic rating scales also highlight its importance in written discourse; Hamp-

Lyons and Henning (1991) included organization as an independent evaluation criterion

in their ESL writing scale, and the IELTS writing rating scale considers coherence and

cohesion an important ESL writing subskill.

Teachers were also concerned about grammatical knowledge. Grammatical

accuracy has been one of the most-researched topics in SLA studies on writing

development and a central theme in ESL writing instruction and research. Achievement

of ESL writing skills has traditionally been defined as mastery of discrete grammar

knowledge and the ability to produce linguistically accurate written text (Kepner, 1991).

The presence of grammatical errors is therefore the primary language-related factor

affecting ESL composition teachers, suggesting that they are excessively concerned with

eradicating grammatical errors in student writing. This study reinforced that recurrent

grammatical errors were teachers‟ primary concern in student writing assessments.

Specifically, teachers‟ attention was focused on more fine-grained, specific aspects of

grammatical knowledge such as verb tense, article use, and preposition use. This finding

was noteworthy because most ESL writing scales measure learners‟ grammatical

competence at a macro-level, obscuring students‟ performance on specific grammar

components.

Teachers also showed considerable interest in various aspects of students‟

vocabulary use. Their attention to the quality of written vocabulary (sophistication,

variety, choice, and collocation) echoed the idea that a good vocabulary leads to good

writing. The importance of vocabulary in written text is supported by theoretical

frameworks of L1 and L2 writing (e.g., Grabe & Kaplan, 1996) and empirical SLA

studies (e.g., Engber, 1995; Laufer, 1991; Laufer & Nation, 1995). It has also been

recognized by a variety of ESL writing scales: Brown and Bailey (1984) and Jacobs et al.

124

(1981) emphasized the close association between vocabulary and writing performance in

their ESL academic writing scales, as did the IELTS rating scale, which included lexical

resource as one constituent subscale. Similarly, Mullen (1977) found that vocabulary

appropriateness accounted for 84.4% of the variance in overall writing performance.

These research findings suggest that vocabulary is indeed an indispensable factor in

determining the quality of writing.

Writing mechanics was another area that drew teachers‟ attention; however, as

Polio (2001) rightly pointed out, mechanics has not been a central concern for language

researchers. There has been very little research examining writers‟ mechanical

proficiency in relation to their writing development. Indeed, Polio speculated whether

mechanics should even be considered a part of writing construct, since the various

aspects of mechanics (such as capitalization, spelling, indentation, and punctuation) are

not conceptually related to each other, making it difficult to form a unitary construct. Still,

mechanical knowledge does play a significant role in writing processes. Knowledge of

written code is achieved through the mastery of orthography, spelling, punctuation, and

formatting conventions (Grabe & Kaplan, 1996), and a writer‟s intended meaning would

be obscured and lost without their appropriate use. The value of mechanics can also be

found in existing writing rating scales. Jacobs et al. (1981) and Brown and Bailey (1984)

considered mechanics a component of their academic writing scales.

In summary, the review of the descriptors suggested that the five writing skills

appear to encompass all aspects of the 39 descriptors: (a) content fulfillment, (b)

organizational effectiveness, (c) grammatical knowledge, (d) vocabulary use, and (e)

mechanics. This skill configuration was consistent with the theoretical discussions and

existing assessment schemes discussed in Chapter 2. The scale created by Jacobs et al.

(1981) was particularly relevant to this classification in that it described the five ESL

writing skills in a comprehensive manner based upon empirical data. The ways in which

the descriptors correspond to the five skills will be discussed in Chapter 6, which

presents the results from the ESL writing experts‟ sorting activity.

125

Refinement of EDD Descriptors

The 39 descriptors elicited from the teachers‟ think-aloud verbal reports were

subjected to review and refinement by ESL academic writing experts. Each descriptor

was examined in order to evaluate whether it was clear, non-redundant, useful, and

relevant to ESL academic writing. Three descriptors were identified as problematic: D14:

Ideas reflect the central focus of the essay, without digressing; D28: Conditional verb

forms are used appropriately; and D38: The essay prompt is well-paraphrased, and is not

replicated verbatim. The experts pointed out that D14 overlapped with D11-D13, and that

D28 and D38 addressed relevant, but too-specific, aspects of ESL writing. Indeed, these

descriptors were rarely mentioned in the ESL teachers‟ think-aloud verbal reports, with

total comments accounting for less than 1% of all verbal protocols. The experts also

suggested combining two descriptors (D16: Syntactic variety is demonstrated in this

essay and D17: Complex sentences are used effectively) to form one new descriptor,

such as “This essay demonstrates syntactic variety, including simple, compound, and

complex sentence structures”. The review and refinement process resulted in the

elimination of three descriptors altogether, and the combination of two other descriptors

into one, for a final total of 35 descriptors (see Table 14).

The clarity of the descriptors was also reviewed. The experts read each

descriptor iteratively and edited it to make an easy and clear descriptor for teachers to use.

Twenty-two descriptors were edited in this manner, with most editing focused on specific

wordings to minimize ambiguity. The descriptors were then examined for distinctiveness

and comprehensiveness. Each descriptor was confirmed to be independent of the others

and comprehensive enough to cover all aspects of ESL academic writing. No new

descriptors were added to the descriptor pool.

When the experts were asked whether the descriptors were conducive to making

a binary choice (yes or no), most commented that while choices on a four-point Likert

scale (strongly agree, somewhat agree, somewhat disagree, or strongly disagree) were

preferable for descriptors that required a subjective judgment (such as D26 and D27),

they were able to use the binary choice as well if necessary. The binary choice was used

for the EDD checklist for several reasons, but primarily because it is difficult to build a

diagnostic model using polytomous data due to technical limitations. Although recent

126

development in CDA has yielded a psychometric diagnostic model that can deal with

polytomous data, the model‟s stability when applied to real (not simulated) data was

unknown. It was also questionable how increasing the parameters in such a model would

affect the model convergence and parameter estimations, given the small sample size in

this study (n=480). Finally, while some descriptors (e.g., D26 and D27) asked the degree

to which a student mastered a given skill, other descriptors relied on the absolute mastery

or non-mastery of a skill. For example, D01, D09, and D34 were more likely to be

answered with a yes (mastery) or a no (non-mastery) choice instead of a Likert-scaled

choice.

127

Table 14

Refined 35 EDD Descriptors

Descriptor

1. This essay answers the question.

2. This essay is written clearly enough to be read without having to guess what the writer is trying to say.

3. This essay is concisely written and contains few redundant ideas or linguistic expressions.

4. This essay contains a clear thesis statement.

5. The main arguments of this essay are strong.

6. There are enough supporting ideas and examples in this essay.

7. The supporting ideas and examples in this essay are appropriate and logical.

8. The supporting ideas and examples in this essay are specific and detailed.

9. The ideas are organized into paragraphs and include an introduction, a body, and a conclusion.

10. Each body paragraph has a clear topic sentence tied to supporting sentences.

11. Each paragraph presents one distinct and unified idea.

12. Each paragraph is connected to the rest of the essay.

13. Ideas are developed or expanded well throughout each paragraph.

14. Transition devices are used effectively.

15. This essay demonstrates syntactic variety, including simple, compound, and complex sentence structures.

16. This essay demonstrates an understanding of English word order.

17. This essay contains few sentence fragments.

18. This essay contains few run-on sentences or comma splices.

19. Grammatical or linguistic errors in this essay do not impede comprehension.

20. Verb tenses are used appropriately.

21. There is consistent subject-verb agreement.

22. Singular and plural nouns are used appropriately.

23. Prepositions are used appropriately.

24. Articles are used appropriately.

128


Descriptor

25. Pronouns agree with referents.

26. Sophisticated or advanced vocabulary is used.

27. A wide range of vocabulary is used.

28. Vocabulary choices are appropriate for conveying the intended meaning.

29. This essay demonstrates facility with appropriate collocations.

30. Word forms (noun, verb, adjective, adverb, etc) are used appropriately.

31. Words are spelled correctly.

32. Punctuation marks are used appropriately.

33. Capital letters are used appropriately.

34. This essay contains appropriate indentation.

35. Appropriate tone and register are used throughout the essay.

129

Summary

This chapter has discussed the identification of the descriptors that make up the

EDD checklist. Think-aloud verbal protocols from nine ESL teachers were open-coded,

focusing on recurrent evaluative themes of ESL academic writing subskills and textual

features. Thirty-nine concrete, fine-grained descriptors were empirically identified and

sequentially confirmed using theoretical analysis. The descriptors addressed all aspects

of ESL writing skills, including content fulfillment, organizational effectiveness,

grammatical knowledge, vocabulary use, and mechanics. The descriptors were then

subjected to the review of four ESL academic writing experts. The review and refinement

process eliminated three descriptors and merged two descriptors into one, resulting in a

final total of 35 EDD descriptors. These 35 descriptors appeared in the EDD checklist

accompanied by a yes or a no response option. The next chapter discusses the

preliminary evaluation of the EDD checklist.

130

CHAPTER 5

PRELIMINARY EVALUATION OF THE EDD CHECKLIST

Introduction

This chapter discusses the preliminary evaluation of the EDD checklist

conducted in Phase 2. Prior to proceeding to the main diagnosis modeling, the checklist‟s

basic functionality was examined from multiple perspectives. Seven ESL teachers piloted

the EDD checklist to assess 80 TOEFL iBT independent essays, determining its

effectiveness. Both quantitative and qualitative data were collected and analyzed in order

to examine the three validity assumptions:


teachers and essay prompts (Teacher and essay prompt effects).






The results derived from these assumptions informed the checklist‟s usability and

furthered the main study. The empirical evidence needed to justify each validity

assumption is presented below.

Teacher and Essay Prompt Effects

Facet Measures

Prior to estimating facet measures, model convergence was checked using a Joint

Maximum Likelihood Estimation (JMLE) algorithm. The convergence criteria were set

at 0.1 for the maximum size of the marginal score residual, and at 0.01 for the maximum

size of the logit change. These tight criteria were chosen to produce a result with high

precision. Convergence was reached after 43 iterations, resulting in 0.0923 for the largest

marginal score residual and -0.0002 for the largest logit change. These negligibly small

values indicated that the score difference and change are insignificant.

131

The extent to which the data fit the model was then examined using two

approaches.26

The first method utilized the information derived from the FACETS

summary report. FACETS provides summary statistics that evaluate whether each facet

has been successfully estimated and whether the data fit the model. When the model-data

fit is satisfied, the mean of the standardized residuals (StRes) is close to 0 and the sample

standard deviation (SD) is close to 1 (Linacre, 2009). As Table 15 shows, the two

statistics support the model-data fit: the mean of the standardized residuals is near 0 (i.e.,

-0.01) and the standard deviationis near 1 (i.e., 1.01).

Table 15

FACETS Data Summary

Category Score Expected

score Residual

Standardized residual

(StRes)

0.59 0.59 0.59 0.00 -0.01 M (Count: 7,326)

0.49 0.49 0.23 0.44 1.01 SD (Population)

0.49 0.49 0.23 0.44 1.01 SD (Sample)

The second model-data fit evaluation method was related to unexpected

responses with extreme standard residual values. According to Linacre (2007), in order

for the data to fit the model, about 5% or less of the total standard residuals should lie

outside of the range of -2 to +2, and about 1% or less should lie outside of the range of -3

to +3. Of 7,326 valid responses, 257 responses (about 3.5%) were associated with

standard residuals above +2 or below -2 and 57 responses (about 0.78%) were associated

with standard residuals above +3 or below -3. The distribution of unexpected responses

was roughly even across all teachers, ranging from 18 to 49 (see Table 16).

Table 16

Distribution of Unexpected Responses across Teachers

Teacher No. of ratings StRes < -2 or StRes > 2 StRes < -3 or StRes > 3

Angelina 1,047 35 2

Ann 1,050 34 12

Beth 1,041 18 2

26

In the Rasch model, the point of interest is whether the data fit the model, not whether the model fits the

data.

132


Teacher No. of ratings StRes < -2 or StRes > 2 StRes < -3 or StRes > 3

Brad 1,050 41 6

Esther 1,043 48 17

Susan 1,046 49 14

Tom 1,049 32 4

Total 7,326 257 57

Once the model-data fit was satisfied, the estimation of each facet measure was

examined. The FACETS variable map displays all facets graphically on a common logit

scale, enabling comparisons within and between facets (see Figure 2). The first column

in the map shows a logit scale applied equally across facets. A higher logit value

indicates a more able examinee, a more difficult task, or a more severe rater, whereas a

lower logit value indicates a less able examinee, a less difficult task, or a less severe rater.

The second column shows writing proficiency measures for the 80 students.

Student proficiency measures ranged from 2.53 logits (Essay 2166) to -2.15 logits (Essay

1006), with 4.68 logit spread: the student who wrote Essay 2166 was the most proficient,

whereas the student who wrote Essay 1006 was the least proficient. The third column

displays the difficulty measures of the essay prompts. As the difficulty measures had

been adjusted to be the same (see Chapter 3), they were placed on the same point of the

logit scale. The fourth column presents the seven teachers‟ severity measures (see Table

17 for a more discussion of teacher measures): Beth the most severe in assessing student

essays, while Esther was the most lenient. The logit spread was 1.08, ranging from 0.15

logits (Beth) to -0.93 logits (Esther). Interestingly, three teachers (Angelina, Brad, and

Tom) exhibited almost the same severity measures. Finally, the fifth column presents

difficulty measures for the 35 descriptors, which ranged from 1.41 logits (D26) to -1.82

logits (D35), with 3.23 logit spread. D26 (word sophistication) was the most difficult for

students to master, while D35 (tone and register) was the easiest. A close examination of

the difficulty measures revealed that descriptors related to vocabulary knowledge were

relatively more difficult than others. For example, D29 (collocation) and D27 (word

variety) were the third and eighth most difficult descriptors (see Appendix N for detailed

information about descriptor measures). On the other hand, descriptors associated with

133

grammatical knowledge were relatively easier. Most grammar-related descriptors (except

for D18 and D19) exhibited difficulty measures below the mean, suggesting relative

easiness. Descriptors measuring content fulfillment also drew attention. These were

considered difficult, as is evident from their position at the top of the column (e.g., D3,

D5, D6, D7, and D8).

The overall pattern of the FACETS variable map suggests that the elements

comprising the teacher facet were least varied compared to those of other facets. The

range of teacher severity measures (1.08 logits) was the narrowest, suggesting that they

exhibited relatively homogeneous rating behaviours. On the other hand, substantial

variability was found in the descriptor difficulty. As the wide spread of the difficulty

measures (3.23 logits) indicates, the descriptors differed greatly in terms of difficulty.

This variation suggests that descriptors measure different facets of writing skills with

different difficulty measures.

134

+--------------------------------------------------------------------------------------------------------------------------------------------------------+

|Measr|+ESSAY |-PROMPT GROUP |-TEACHER |-DESCRIPTOR |

|-----+------------------------------------------------------------+----------------------------+------------------------------+-------------------------|

| 3 + + + + |

| | | | | |

| | | | | |

| | | | | |

| | 2116 | | | |

| | | | | |

| | | | | |

| | | | | |

| 2 + + + + |

| | 2096 | | | |

| | | | | |

| | | | | |

| | 1070 1111 2119 | | | |

| | 1110 | | | D26 |

| | 1113 | | | |

| | 1104 2131 | | | |

| 1 + 1037 + + + D5 |

| | 1101 1112 | | | D10 D29 D6 |

| | 1134 2025 2120 2148 | | | D14 D8 |

| | 1109 2058 2107 | | | D19 D27 D3 D7 |

| | 1005 1055 1069 1088 1107 1117 2051 2076 2095 2104 | | | D31 |

| | 1114 2079 2097 2099 | | | |

| | 1146 2109 | | | D18 |

| | 1010 1056 1080 | | Beth | D1 D11 D2 D30 D34 |

* 0 * 1013 1020 1038 1053 2050 2070 * COOPERTATION SUBJECT * * D13 D32 D4 *

| | 1023 2032 2074 2077 2080 | | | D17 D24 |

| | 1002 1018 2005 2022 | | | D15 D20 |

| | 1004 1014 1074 2002 2018 2023 | | Angelina Brad Tom | D12 D23 D28 |

| | 1011 1081 2081 | | | |

| | 2067 | | Ann | D9 |

| | 1050 2001 2020 | | Susan | |

| | 1003 1008 | | Esther | |

| -1 + 2003 + + + |

| | 2004 | | | D21 D25 |

| | 1009 2029 | | | D22 D33 |

| | 1015 | | | |

| | 1007 | | | D16 |

| | 2011 | | | |

| | 2013 | | | |

| | 2006 2019 | | | D35 |

| -2 + + + + |

| | 1006 2015 | | | |

| | | | | |

| | | | | |

| | | | | |

| | | | | |

| | | | | |

| | | | | |

| -3 + + + + |

|-----+------------------------------------------------------------+----------------------------+------------------------------+-------------------------|

|Measr|+ESSAY |-PROMPT GROUP |-TEACHER |-DESCRIPTOR |

+--------------------------------------------------------------------------------------------------------------------------------------------------------+

Figure 2. FACET Variable Map

135

Teacher Internal Consistency

The extent to which the teachers were internally consistent was examined based

upon teacher fit statistics. Teacher fit statistics indicate the degree to which each teacher

is internally consistent in his or her ratings. Different rules of thumb are applied for

interpreting fit statistics and for setting upper and lower limits because they are more or

less context-dependent and require a targeted use of the test results (Myford & Wolfe,

2004a). When a test of interest is used to make a high-stakes decision, tight quality

control limits (such as mean squares of 0.8 to 1.2) are set; however, if the stakes are low,

looser limits are acceptable. Wright and Linacre (1994) proposed that the mean square

values of 0.6 to 1.4 are reasonable limits for data gathered using a rating scale.

In this study, the lower and upper quality control limits were set at 0.5 and 1.5,

respectively (Lunz & Stahl, 1990), since this study examines the rating behaviours of

teachers in a classroom setting rather than in a high-stakes test setting. An infit mean

square value less than 0.5 indicated overfit or a lack of variability in their scoring, while

an infit mean square value greater than 1.5 indicated significant misfit or a high degree of

inconsistency in the ratings. Table 17 presents several of the statistics associated with the

teacher facet; in particular, the fifth and sixth columns display the infit and outfit mean

squares for each teacher. All infit and outfit mean squares were within the range of 0.5

and 1.5, indicating that none of the teachers exhibited misfitting or overfitting rating

patterns and that all were internally consistent in their ratings.

Table 17

Teacher Measure Statistics

Teacher Observed

Average

Measure

(logits)

Model

S.E.

Infit

MnSq

Outfit

MnSq

Corr.

PtBis

Exact

Obs %

Agree.

Exp %

Angelina 0.6 -0.35 0.07 1.02 1.02 0.20 65.9 59.7

Ann 0.6 -0.64 0.07 1.01 1.05 0.22 64.1 60.5

Beth 0.5 0.15 0.07 0.89 0.84 0.29 65.9 57.8

Brad 0.6 -0.37 0.07 1.01 1.01 0.25 63.7 60.9

Esther 0.6 -0.93 0.07 1.04 1.15 0.20 63.9 60.4

Susan 0.7 -0.71 0.07 1.02 1.07 0.23 64.7 61.4

Tom 0.6 -0.39 0.07 1.00 0.97 0.25 64.6 60.1

136


Teacher Observed

Average

Measure

(logits)

Model

S.E.

Infit

MnSq

Outfit

MnSq

Corr.

PtBis

Exact

Obs %

Agree.

Exp %

M 0.6 -0.46 0.07 1.00 1.01 0.23

SD 0.1 0.32 0.00 0.05 0.09 0.03

RMSE (Model) = 0.07

Separation = 4.38

Fixed (all same) chi-square = 143.2

Significance (probability) = 0.00

Exact agreements: 6,275 = 64.7%

Adj. SD = 0.31

Separation (not inter-rater) Reliability = 0.95

d.f. = 6

Inter-Rater agreement opportunities: 9,701

Expected agreements: 5832.1 = 60.1%

Note. Infit MnSq = Infit Mean Square, Outfit MnSq = Outfit Mean Square, Corr. PtBis = Point-Biserial

Correlation, Exact Obs % = Percentage of Exact Observed Agreement. Agree. Exp % = Percentage of

Expected Agreement, RMSE = Root Mean square Standard Error, Adj. SD = Adjusted Standard Deviation.

A more detailed analysis using the rater effect criteria proposed by Wolfe, Chiu,

and Myford (1999) was conducted to further examine teachers‟ internal consistency.

Wolfe et al. adopted tight quality control indices (mean squares of 0.7 and 1.3 for the

lower and upper limits) and determined rater effect using fit statistics and the proportion

of unexpected ratings for each rater (Zp).27

The combinations of these indices indicate

accurate, random, halo/central, and extreme rating patterns, respectively. According to

Myford and Wolfe (2004b), a random rating pattern occurs when raters use one or more

scales inconsistently compared to other raters, while an extreme rating pattern occurs

when raters assign ratings at the high or low ends of the scale. The halo effect occurs

when raters assign similar ratings to a distinctive trait, and the centrality effect occurs

when raters overuse the middle categories of a rating scale (Myford & Wolfe, 2004b).

Table 18 presents the ways in which the rater effect is determined based upon fit

statistics and Zp indices. The teachers‟ rating behaviour is summarized in the last column,

indicating that all of the teachers exhibited accurate rating patterns.

27

For the discussion about how to compute a Zp index, see Myford and Wolfe (2000).

137

Table 18

Teacher Effect

Rater effect Infit MnSq Outfit MnSq Zp No. of teachers

Accurate 0.7 infit 1.3 0.7 outfit 1.3 Zp 2.00 7

Random infit>1.3 outfit>1.3 Zp >2.00 0

Halo/Central infit<0.7 outfit<0.7 Zp >2.00 0

Extreme 0.7 infit 1.3 outfit>1.3 Zp >2.00 0

Teacher Agreement

Two approaches were used in order to examine the degree of agreement between

teacher assessments. The first used a percentage of exact agreement, which indicated the

percentage of times that each teacher provided exactly the same ratings as another

teacher under identical circumstances. The agreement statistics and expected values were

provided by FACETS. As the eighth column of Table 17 shows, the exact observed

agreement of teachers ranged from 63.7% to 65.9% (M = 64.7%).28

Although this range

does not seem to support the idea of substantial agreement among teachers, it is

reasonable considering that the teachers were not trained as professional raters of ESL

writing. A similar agreement pattern is found in other writing assessment research.

Barkaoui (2008) reported that teachers‟ agreement reached 22.4% when they used a nine-

point holistic rating scale and that their agreement was 23.1% when they used a nine-

point analytic rating scale. When his teacher group was examined further, novice

teachers showed 20.0% agreement, while experienced teachers exhibited 26.3%

agreement. His findings seem to confirm the difficulty of achieving high agreement

among teachers who are not trained as professional assessment raters, echoing this

study‟s finding.

However, when well-trained certified raters are involved in a high-stakes ESL

writing assessment, a fair amount of agreement can be achieved. Knoch (2007) examined

the functionality of two analytic ESL writing scales: Diagnostic English Language Needs

28

According to Linacre (2009), observed exact agreement is defined as “the proportion of times one

observation is exactly the same as one of the other observations for which there are the same circumstances”

(p. 237). The observed exact agreement of 64.7% was therefore computed as (6,275/9,701) 100 = 64.7%.

On the other hand, an expected agreement is defined as an “expected percent of exact agreements between

raters on ratings under identical conditions, based on Rasch measures” (p. 160).

138

Assessment (DELNA) and a newly developed diagnostic scale, and reported somewhat

fair, but still unsubstantial, agreement. The two rating scales were developed to assess

student writing skills and consisted of six levels (in the case of DELNA) and four to six

levels (in the case of the new diagnostic scale). When raters used the DELNA rating

scale, their agreement ranged from 33% to 41.7% (M = 37.92, SD = 2.49); when the new

diagnostic scale was used, agreement ranged from 36.1% (for a six-level scale) to 61.9%

(for a four-level scale) (M = 51.15, SD = 7.94). That the raters were well-trained certified

professionals must have contributed to this fair or moderate agreement, but it still

indicates that it is extremely difficult for raters to achieve substantial agreement on

writing assessments, possibly because of the inherently subjective nature of the task.

The second approach to examining inter-teacher reliability was a correlation

between a single rater and the rest of the raters (SR/ROR). SR/ROR correlation indicates

the degree to which one particular rater (i.e., the single rater) rank-orders examinees in a

manner consistent with all other raters. According to Myford and Wolfe (2004a),

SR/ROR correlations greater than 0.7 are considered high for an assessment in which a

multiple-level rating scale is involved, whereas SR/ROR correlations less than 0.3 are

thought to be somewhat low. Still, they caution that the control limit must be relaxed as

the number of scale categories decreases: for example, they report that SR/ROR

correlations as low as 0.2 are common in dichotomous ratings.29

As the seventh column

of Table 17 illustrates, teachers‟ SR/ROR correlations in this study ranged from 0.20 to

0.29 (M = 0.23, SD = 0.03), suggesting that each teacher rank-ordered students in a

manner similar to that of the other teachers.30

Further analysis was conducted in order to examine the extent to which the

teachers agreed on each individual descriptor. The percentage of teachers‟ ratings that

agreed on each descriptor per essay was calculated, and the mean and standard deviations

of the agreements on 10 essays were examined. Ratings were derived from the 10 essays

in Batch 03 because these essays were assessed by all of the teachers. As Table 19 shows,

teachers had the highest agreement on D16 (word order; agreement = 90%) and exhibited

the lowest agreement on D13 (idea development; agreement = 61.43%). When the

29

An SR/ROR correlation near or less than 0 indicates low inter-rater reliability. 30

SR/ROR correlations are referred to as point-biserial correlations in FACETS analysis.

139

descriptors that elicited high agreement (> 85%) were examined, most were related to

discrete grammar knowledge (e.g., D16, D23, D31, and D32). On the other hand, when

the descriptors that elicited low agreement (< 70%) were examined, they were found to

be associated with global content skills (e.g., D01, D05, D06, D07, and D13). These

results are consistent with Milanovic et al.‟s (1996) findings suggesting that essay

content is the most subjective component element because raters‟ personal reactions

might significantly affect their rating.

Table 19

Teacher Agreement on Descriptors

Descriptor Agreement (%) SD Descriptor Agreement (%) SD

D01 65.71 13.80 D19 77.14 13.80

D02 80.00 12.05 D20 82.86 17.56

D03 74.29 17.56 D21 80.00 21.51

D04 70.00 18.38 D22 77.14 18.07

D05 68.57 14.75 D23 85.71 15.06

D06 65.71 13.80 D24 78.57 15.43

D07 68.57 14.75 D25 84.29 18.38

D08 74.29 13.13 D26 77.14 13.80

D09 80.00 15.36 D27 81.43 15.13

D10 71.43 15.06 D28 77.14 15.36

D11 70.00 14.21 D29 82.86 14.75

D12 75.71 16.56 D30 78.57 13.88

D13 61.43 6.90 D31 87.14 18.38

D14 77.14 16.77 D32 85.71 15.06

D15 78.57 15.43 D33 84.29 15.72

D16 90.00 15.13 D34 72.86 18.38

D17 77.14 18.07 D35 85.71 9.52

D18 77.14 19.28

140

Bias Analysis

A bias analysis was carried out to further explore the interaction between the

teachers and the descriptors. The extent to which a teacher was biased for or against a

particular descriptor was standardized to a z-score in a bias analysis. A teacher with a z-

score between -2 and +2 was considered to be using a descriptor without significant bias.

When the z-score was below -2, the teacher was using that particular descriptor in a

significantly lenient manner compared to how he or she used other descriptors. When the

z-score was greater than +2, the teacher was using that descriptor more severely than he

or she did other descriptors.

Table 20 presents the bias terms between the teachers and the descriptors. A

fixed chi-square test indicated an overall, significant biased interaction between the

teachers and the descriptors, p = .00. When an individual interaction effect was examined,

only few such cases were found: Beth was particularly severe with D03 (conciseness; z =

2.90, p = .01) and D08 (specific ideas and examples; z = 2.75, p = .01), while Susan was

particularly lenient with D27 (word variety; z = -2.80, p < .05) compared to other

descriptors. Except for these specific cases, teachers were not positively or negatively

biased toward any particular descriptors.

Table 20

Interactions between Teachers and Descriptors

Teacher Measr Des Measr Obsvd

Score

Exp.

Score

z-

score

Model

S.E. t d.f. p

Ann -0.64 D16 -1.55 30 26.2 -2.63 1.73 -1.52 29 .14

Beth 0.15 D03 0.64 1 10.3 2.90 1.03 2.83 28 .01

Beth 0.15 D08 0.79 1 9.4 2.75 1.03 2.68 28 .01

Susan -0.71 D27 0.64 28 16.4 -2.80 0.76 -3.71 29 .00

M (Count: 245) 17.7 17.7 -0.02 0.46 0.01

SD (Population) 6.2 4.6 0.83 0.14 1.61

SD (Sample) 6.2 4.6 0.83 0.14 1.61

Fixed (all = 0) chi-square: 633.8 d.f.: 245 significance (probability): .00

When the possible interactions between the prompts and the descriptors were

examined, no bias effect was found for or against either prompt.

141

Correlation between EDD Scores and TOEFL Scores

The writing proficiency measures estimated by the MFRM analysis were

correlated with the scores awarded by ETS raters across the 80 essays. The magnitude of

the correlation was substantial; r = .77, p < .01 for the subject prompt and r = .78, p < .01

for the cooperation prompt. The overall correlation for the entire essays was also

moderately strong (r = .77, p < .01). This result provides some convergent evidence that

the EDD checklist measures the same writing construct measured by the TOEFL iBT

independent writing rating scale; however, further evidence is needed given the limited

nature of correlation to explain a construct issue.

Teacher Perceptions and Evaluations

Teacher Confidence Levels

The degree to which the teachers were confident about their assessments across

the 35 descriptors on 10 essays (5 essays × 2 prompts) is presented in Tables 21 and 22.

Teachers‟ confidence levels showed a similar pattern across the two prompts, with the

mean ranging from 76.57% (D06) to 98% (D33) on the subject prompt and from 78.29%

(D08) to 97% (D34) on the cooperation prompt. They were generally confident in

assessing writing subskills related to D22 (singular and plural nouns), D24 (article use),

D31 (spelling), D33 (capitalization), D34 (indentation), and D35 (tone and register).

These descriptors showed confidence levels greater than 90% across the two prompts.

Teachers appeared less confident in assessing content-related writing skills, with

confidence levels lower than 80% on D05 (strong argument), D06 (enough ideas and

examples), and D07 (logical ideas and examples) across the two prompts. This result

suggests that teachers were more confident using descriptors associated with surface-

level grammatical (D22 and D24) and mechanical features (D31, D33, and D34) than

they were using those related to global content areas (D05, D06, and D07), and that the

subjective nature of the content criteria might have affected their confidence levels.

When confidence levels were examined across teachers, Tom and Ann were consistently

the most confident in using the descriptors (mean confidence > 95%), whereas Brad was

consistently less confident (mean confidence < 80%) on the two prompts.

142

The teacher agreement and confidence level data points were plotted together on

the same graph in order to closely examine their associations (see Figure 3). When

overall trends were examined, teachers notably agreed less on the content-related

descriptors (D04, D05, D06, D07, and D08), as they were less confident in using them.

Teacher agreement thus seems to reflect the teachers‟ confidence levels to some extent,

with one exception: although they expressed their confidence in assessing the writing

skills associated with D34 (indentation), the teachers‟ agreement was not as high as

expected.31

Care should be taken in how the relationship between teacher agreement and

confidence is interpreted.

31

One teacher commented on the follow-up questionnaire that it was not clear how many spaces were

considered appropriate indentation. This issue is revisited in the next section.

143

Table 21

Teacher Confidence (%) on the Subject Prompt

Teacher D01 D02 D03 D04 D05 D06 D07 D08 D09 D10 D11 D12 D13 D14 D15 D16 D17 D18

Angelina 100 70 78 78 82 66 74 66 86 82 72 68 60 84 60 68 62 70

Ann 94 96 100 94 90 94 94 94 100 100 98 100 98 100 100 98 100 100

Beth 92 95 94 90 86 88 88 100 95 95 82 80 68 86 95 72 96 100

Brad 100 65 80 72 60 60 70 70 60 70 80 74 80 85 60 60 60 62

Esther 94 80 74 58 48 52 48 64 92 64 68 66 56 84 66 72 85 85

Susan 100 90 100 70 90 84 84 84 90 100 82 84 100 90 100 90 100 100

Tom 100 100 88 96 94 92 96 94 90 94 100 96 92 100 98 100 97 98

M 97.14 85.14 87.71 79.71 78.57 76.57 79.14 81.71 87.57 86.43 83.14 81.14 79.14 89.86 82.71 80.00 85.71 87.86


Teacher Confidence (%) on the Subject Prompt

Teacher D19 D20 D21 D22 D23 D24 D25 D26 D27 D28 D29 D30 D31 D32 D33 D34 D35 M

Angelina 80 70 74 76 68 72 78 78 78 74 54 70 76 78 86 100 100 75.37

Ann 98 100 100 98 100 100 100 98 100 98 100 100 100 100 100 100 99 98.31

Beth 94 95 100 94 85 100 88 90 94 95 86 95 95 89 100 100 89 91.17

Brad 80 80 82 100 80 100 84 100 70 70 60 60 96 66 100 100 100 77.03

Esther 68 94 72 98 74 64 62 100 78 48 82 70 100 78 100 64 90 74.23

Susan 84 100 100 100 100 100 84 70 74 84 90 100 100 100 100 80 100 91.54

Tom 98 100 98 98 100 100 100 100 98 98 94 98 100 100 100 98 100 97.29

M 86.00 91.29 89.43 94.86 86.71 90.86 85.14 90.86 84.57 81.00 80.86 84.71 95.29 87.29 98.00 91.71 96.86 86.42

144

Table 22

Teacher Confidence (%) on the Cooperation Prompt

Teacher D01 D02 D03 D04 D05 D06 D07 D08 D09 D10 D11 D12 D13 D14 D15 D16 D17 D18

Angelina 92 82 80 84 72 70 82 80 82 94 88 78 78 100 78 80 74 74

Ann 92 98 98 92 88 92 98 98 98 88 90 98 94 100 96 100 99 100

Beth 79 85 88 85 87 89 85 88 73 89 86 87 87 86 69 74 75 80

Brad 100 70 70 70 60 70 62 60 60 75 80 70 80 60 70 70 66 60

Esther 74 74 88 80 68 64 62 54 74 64 66 74 70 80 100 88 86 94

Susan 100 100 94 70 90 90 84 76 90 84 96 100 100 100 100 98 100 100

Tom 86 100 100 98 94 84 86 92 100 100 98 100 100 100 100 96 100 100

M 89.00 87.00 88.29 82.71 79.86 79.86 79.86 78.29 82.43 84.86 86.29 86.71 87.00 89.43 87.57 86.57 85.71 86.86


Teacher Confidence (%) on Cooperation Prompt

Teacher D19 D20 D21 D22 D23 D24 D25 D26 D27 D28 D29 D30 D31 D32 D33 D34 D35 M

Angelina 80 78 86 92 64 74 80 76 80 74 60 76 64 80 92 92 92 80.23

Ann 100 100 98 100 100 100 100 98 98 100 96 100 100 100 100 100 100 97.40

Beth 74 78 80 75 80 96 78 90 93 88 100 74 77 80 87 87 77 83.03

Brad 80 80 80 80 80 98 80 70 78 90 70 80 100 80 94 100 100 76.94

Esther 100 66 90 92 90 78 82 100 88 78 88 82 100 76 90 100 66 80.74

Susan 100 92 100 96 100 100 92 84 74 84 100 96 100 90 100 100 100 93.71

Tom 98 98 100 98 100 100 100 100 96 98 100 100 100 100 100 100 100 97.77

M 90.29 84.57 90.57 90.43 87.71 92.29 87.43 88.29 86.71 87.43 87.71 86.86 91.57 86.57 94.71 97.00 90.71 87.12

145

Figure 3. The scatter plot for teacher agreement and confidence

Teacher Questionnaire Responses and Interviews

The teachers‟ perceptions when using the EDD checklist were examined. Their

positive and negative reactions as reported on the questionnaire were analyzed

descriptively. The EDD checklist evaluation focused primarily on its (a) clarity, (b)

redundancy, (c) relevance and usefulness, and (d) strengths and weaknesses. When the

teachers were asked about the number of times they read the essays as marked them, two

teachers said “twice”, four teachers said “three times”, and one teacher said “more than

three times”. This result suggests that the EDD checklist prompted the teachers to read

essays carefully so that they could answer all 35 descriptors.

When it came to overall satisfaction with using the EDD checklist in their essay

assessments, two teachers said that they liked the checklist “a little bit”, one teacher liked

it “quite a lot”, and four teachers liked it “very much”. Specifically, five teachers said

that the EDD descriptors were clearly understood, whereas two teachers said that they

were not. Of these two teachers, one pointed out that the words “strong,” “clear,” and

“few” were too subjective to render a yes or no decision, while the other commented that

the descriptors “sophisticated or advanced vocabulary is used (D26)” and “a wide range

of vocabulary is used (D27)” were highly interrelated to the writer‟s educational

background and a specific test context, rendering judgment difficult. She also

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30 35

Per

cent

(%)

Descriptors

agreement confidence on prompt 1 confidence on prompt 2

146

commented that it was not clear how many spaces were considered appropriate

indentation (D34).

Three teachers reported some redundancy in the EDD descriptors: two noted that

a single descriptor could be created by combining two seemingly similar descriptors “this

essay is written clearly enough to be read without having to guess what the writer is

trying to say (D02)” and “grammatical or linguistic errors in this essay do not impede

comprehension (D19)”. Another teacher commented that there was considerable overlap

between “making run-on sentences or comma splices (D18)” and “misuse of punctuation

marks (D32)” because run-on sentences and comma splices naturally suggest a lack of

appropriate punctuation. These points were directly related to multi-divisibility of ESL

writing skills, as shown in the Q-matrix construction. Along the same lines, an essay‟s

clarity can be partially achieved by writing error-free text, as the knowledge of

punctuation usage can partially prevent writers from creating run-on sentences or comma

splices.

Although all seven teachers agreed that the EDD checklist was useful and

relevant for assessing ESL academic writing ability, two teachers pointed out that the

EDD checklist was not comprehensive enough to capture all circumstances in ESL

academic writing. One teacher suggested that more descriptors related to content

development and argument presentation should be included in the checklist, and another

suggested that the ability to paraphrase or create pre-writing strategies should also be

assessed.

The teachers‟ evaluations of the EDD checklist were examined from a slightly

different perspective. They were asked to judge the relative importance of the descriptors

in developing students‟ ESL academic writing. The results indicated that most teachers

felt that the descriptors associated with content development and organization were much

more important than the others (such as punctuation) because the fundamental goal of

academic writing is to make an effective and persuasive argument. This argument echoed

the need for differential weighting on the descriptors, since certain descriptors might not

be as important as others that assess the core construct of ESL academic writing.

Of particular interest were the teachers‟ comments on the strengths and

weaknesses of the EDD checklist. Their open-ended responses highlighted a variety of

147

important issues. Overall, the teachers thought that the EDD checklist covered many

important elements of ESL academic writing and appreciated that the checklist enabled

them to view an essay in a comprehensive and detailed manner. One teacher commented

that “itemization of writer skills greatly helped to focus on what to look for during the

assessment.” Ironically, comprehensiveness was also considered a weakness: three

teachers said that the checklist was “too long” and “too time-consuming” to be

implemented in a classroom assessment. Conflicting opinions also existed with regard to

the use of binary choice: two teachers felt that this method was too limited to allow for

consistent decisions. By contrast, another teacher called binary choice the checklist‟s

genuine strength for its ability to facilitate accurate and fast decisions. Indeed, five

teachers reported on the Likert-scale questionnaire items that the EDD checklist was

conducive to making a binary-choice, while two did not agree.

In addition to the lengthy process and lack of scale, weighting was another

important issue raised by the teachers. Two rightly argued that certain descriptors must

be weighted more heavily than others to better reflect a student‟s overall writing

competence; Brad pointed out, for example, that the ability to make strong main

arguments is a more important writing skill than the ability to use capital letters and thus

deserves greater weighting:

I think the descriptors sometimes don‟t give an accurate reflection. For example,

some essays I graded were poor, but scored well, because the capitalizations

were fine or punctuations were okay. But, they missed the fundamental areas of

academic writing highlighted in descriptors 1-14, for example. (Brad)

Angelina also correctly noted that a writer might be unfairly penalized or

rewarded simply on the grounds that he or she did not employ a specific writing device:

I also found descriptor #29 slightly problematic in that in some cases I found it

hard to determine the test-taker‟s grasp of collocations and idiomatic expressions

because they rarely or simply did not use them in their answer. (Angelina)

She further questioned whether the EDD checklist considers both the frequency and the

nature of errors. As she rightly argued, consistent elementary spelling mistakes should be

treated differently than one single serious spelling mistake.

Despite these limitations, all of the teachers expressed positive views of the EDD

checklist‟s diagnostic function and appreciated its positive impact on student learning

and teacher instruction. The EDD checklist was thus determined to function as intended

148

and was confirmed for use in the main study. No revisions were made to the checklist,

not only because not all teachers raised the same problem, but because they often had

contradictory opinions. Even if the descriptors had been revised, it was unknown how the

revisions would affect the teachers‟ assessment behaviors or perceptions. In addition,

psychometric problems such as determining the relative weight of each descriptor could

not be approached without a precise procedure based upon empirical evidence. Creating

additional descriptors that took different facets of writing skills or language errors into

account was also a daunting task, given the already high number of descriptors. Instead,

in-depth rater training was held during the main study, using slightly revised assessment

guidelines in order to help teachers fully understand and effectively employ the checklist.

Summary

This chapter has examined three validity assumptions centered on the

preliminary evaluation of the EDD checklist. Each assumption was carefully examined

based upon multiple pieces of empirical evidence. The study‟s findings provided a

somewhat fuzzy picture of the generalizability of the scores derived from the checklist;

agreement rates among teachers were not substantially high in spite of high intra-teacher

reliability. The high correlation between EDD scores and TOEFL scores provided

convergent evidence for use of the checklist; however, this criterion-related validity

claim should be interpreted carefully because the two rating rubrics were developed for

different test purposes. Although the two sets of scores were highly correlated, divergent

evidence could indicate underlying differences between the two rubrics. Overall teacher

confidence and evaluation further justified the validity claims for the use of the EDD

checklist. Most teachers used it without much difficulty and valued its diagnostic

function. The chain of validity inquiries in this chapter evidenced the overall usability of

the EDD checklist and ensured its suitability for use in the main study. The next chapter

further discusses the primary evaluation of the EDD checklist.

149

CHAPTER 6

PRIMARY EVALUATION OF THE EDD CHECKLIST

Introduction

This chapter discusses the primary evaluation of the EDD checklist conducted in

Phase 2. The overall results of the pilot study were positive, allowing the checklist to be

used in modeling diagnostic writing skill profiles in the main study. Ten ESL teachers

assessed 480 TOEFL iBT independent essays using the checklist, and then evaluated its

use in the questionnaire and interviews. Both quantitative and qualitative data were

collected and analyzed in order to examine the three validity assumptions:

The EDD checklist provides a useful diagnostic skill profile of ESL academic

writing (Characteristics of the diagnostic ESL academic writing skill profiles).






Each facet of the assumptions addressed the different aspect of validity arguments and

provided valuable information used to justify the score-based interpretation and use of

the EDD checklist. The empirical evidence needed to examine each validity assumption

is discussed below.

Characteristics of the Diagnostic ESL Academic Writing Skill Profiles

Dimensional Structure of the EDD Checklist

The dimensional structure of the EDD checklist was analyzed both substantively

and statistically prior to the examination of its diagnostic capacity. The substantive

analysis was carried out based on the outcome of the descriptor sorting activity

conducted by the ESL experts. The results of the substantive analysis were used to

construct a Q-matrix. The statistical analysis was conducted using a series of conditional

covariance-based nonparametric dimensionality techniques. The results of both

substantive and statistical dimensionality analyses informed the extent to which the test

construct was multidimensional in relation to the assumption of the diagnostic

assessment model.

150

Substantive Dimensionality Analysis

Four ESL academic writing experts independently sorted the refined EDD

descriptors into dimensionally distinct ESL writing skills using their own skill

configuration. Each expert had a different conceptualization of ESL writing skills and

produced a different skill categorization scheme. Gary divided the descriptors into five

categories: (a) organization, (b) grammar, (c) vocabulary, (d) style, and (e) formatting,

commenting that vocabulary knowledge is closely related to writing style and that all

ESL writing skills are intertwined with each other. Gary also argued for a holistic

interpretation of writing, pointing out that it was difficult to analytically distinguish one

skill from the others, and that the whole is not always the sum of its parts.

Jane‟s categorization was particularly interesting, as she conceptualized ESL

writing skills from a hierarchical perspective, with a skill identification scheme layered

according to its (a) word, (b) sentence, (c) paragraph, and (d) essay components. From a

slightly different perspective, Anthony categorized descriptors into (a) idea development,

(b) organization, (c) language use, (d) vocabulary, and (e) punctuation, noting that idea

development is associated with the meaning of written text whereas organization is

focused on the form of written text. Anthony also suggested subdividing the language use

category into global and local levels. Alex‟s categorization scheme was similar to those

of Gary and Anthony, incorporating (a) content, (b) organization, (c) grammar, and (d)

mechanics. Overall, the experts‟ skill identification results indicated that the

predetermined sorting categories used in this study were comprehensive and compatible

with their empirical skill configurations.

After the experts had constructed their own skill schemes, they were asked to

identify skills-by-descriptors relationships using the predetermined skill categories

including (a) content fulfillment, (b) organizational effectiveness, (c) grammatical

knowledge, (d) vocabulary use, and (e) mechanics, so that a Q-matrix could be

constructed. Before beginning the sorting task, the four writing experts agreed that these

five writing skills described the characteristics of descriptors well, and represented the

construct of ESL academic writing. Table 23 shows the ways in which the experts related

the descriptors to a specific writing skill. It also indicates that while experts assigned a

single skill to most descriptors, multiple skills were assigned in some cases. Different

151

experts appeared to have different conceptualizations of content fulfillment and

organizational effectiveness (see the skill assignments to D01-D14) and grammatical

knowledge and mechanics (see the skill assignments to D17, D18, D31, and D33).

When the experts‟ agreement was examined, it was found that they had achieved

100% agreement on the 20 descriptors. The areas in which most discrepancy was

exhibited were D2, D3, D14, and D35, descriptors which focused on more holistic

assessment of an essay‟s general quality. There was considerable disagreement on D35,

which assessed the tone and register of an essay, because it could have been mastered by

appropriate use of vocabulary or the consistent interplay of all aspects of ESL writing

skills throughout the essay. It was interesting that Anthony suggested that the grammar

category be subdivided into lexical, sentence, and discourse levels or correctable and

non-correctable error aspects; this suggestion was convincing considering that the

grammar skill included the greatest number of descriptors. However, further analysis was

not conducted in order to keep the grain size of all the skills consistent.

A Q-matrix was finally constructed based on the sorting activity outcomes. Each

skill-by-descriptor correspondence was reviewed, taking all of the experts‟ opinions into

account. As many skills as possible were assigned to a descriptor if a different skill was

noted by the experts, and relevant ESL writing literature was consulted as a final

judgment call. The initial Q-matrix entry can be found in the last column of Table 23.

Twenty-one descriptors were matched to one single skill, and the remaining 14 were

matched to multiple skills. Grammatical knowledge contained the greatest number of

descriptors, while vocabulary use and mechanics contained a relatively smaller number

of descriptors. Since students greatly desire feedback on grammatical problems in their

writing (Cohen & Cavalcanti, 1990; Ferris, 1995; Hedgcock & Lefkowitz, 1994; Leki,

1991), the large number of descriptors in this area is reasonable; however, the relatively

small number of descriptors in vocabulary use and mechanics was somewhat problematic

because it can cause instability of parameter estimates. The initial Q-matrix subjected to

diagnosis modeling can be found in Appendix O.

152

Table 23

Experts‟ Descriptor Classification

Descriptor Gary Jane Anthony Alex Q-matrix Entry

D01 CON CON CON CON CON

D02 ORG CON, ORG*

CON,

GRM,ORG ORG CON, ORG

D03 VOC CON,VOC,ORG CON GRM,ORG,

CON

CON, ORG,

VOC

D04 CON CON, ORG ORG ORG,CON CON, ORG

D05 CON CON CON,ORG CON CON, ORG


D07 ORG CON CON ORG CON, ORG


D09 ORG ORG ORG ORG ORG


D11 ORG ORG ORG,CON ORG ORG, CON


D13 ORG ORG CON CON ORG, CON

D14 ORG ORG ORG ORG,GRM,

VOC

ORG, GRM,

VOC

D15 GRM GRM GRM GRM GRM,

D16 GRM GRM GRM GRM GRM

D17 GRM GRM,MCH GRM GRM GRM, MCH

D18 GRM GRM,MCH GRM GRM GRM, MCH








D26 VOC VOC VOC VOC VOC



D29 GRM VOC VOC GRM VOC, GRM

153


Descriptor Gary Jane Anthony Alex Q-matrix Entry

D30 VOC GRM GRM GRM GRM, VOC

D31 MCH MCH GRM MCH MCH, GRM

D32 MCH MCH MCH MCH MCH

D33 MCH MCH GRM MCH MCH, GRM

D34 MCH MCH MCH MCH MCH

D35 VOC CON,GRM,

VOC

VOC,ORG GRM,MCH VOC, GRM,

CON, ORG,

MCH

Note. When multiple skills are assigned to a descriptor, a primary skill appears before a secondary skill.

For example, the notation of CON, ORG indicates that CON is a primary skill and ORG is a secondary

skill. CON=content fulfillment, ORG=organizational effectiveness, GRM=grammatical knowledge,

VOC=vocabulary use, and MCH=mechanics.

Statistical Dimensionality Analysis

An exploratory DIMTEST analysis resulted in the rejection of the null

hypothesis of unidimensionality with an extremely small p-value, T = 7.28, p < .00.

Twelve descriptors were selected as an initial AT set by the program, including six CON

descriptors (D01, D03, D04, D05, D07, and D08) and six ORG descriptors (D09, D10,

D11, D12, D13, and D14). The subsequent exploratory DIMTEST analysis failed to

reject the null hypothesis, suggesting that CON and ORG skills are dimensionally

distinct from GRM, VOC, and MCH skills.

DETECT was then performed in an exploratory manner in order to estimate the

number of dimensions present in the data and the magnitude of the multidimensionality.

As Table 24 shows, the descriptors were separated into four clusters maximizing the

DETECT index. Consistent with the results of the exploratory DIMTEST analysis, the

CON and ORG descriptors (D01-D14) constituted the first cluster. The DETECT value

was noticeably large (DETECT index = 1.25), indicating strong evidence of

multidimensionality. In addition, the IDN and r indices were close to 1 (IDN index =

0.82 and r index = 0.79), indicating that the approximate simple structure held true for

the data.

154

Table 24

Descriptor Clusters Identified by DETECT

Cluster Descriptor

1 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14

2 15, 17, 26, 27

3 16, 18, 19, 20, 21, 22, 23, 24, 25, 28, 29, 30, 31, 32, 33

4 34, 35

An exploratory CCPROX/HCA analysis was also performed to visually examine

the most interpretable cluster solution in the data. The primary skill of each descriptor

was analyzed in order to represent the skills-by-descriptors relationship. Figure 4

displays part of the CCPROX/HCA output from Levels 18 to 34. Each column illustrates

one level of cluster analysis, with descriptors within the different clusters separated by

asterisks (***). Visual inspection suggested that the five-cluster solution is likely to be

the most interpretable. From Level 22, the CON descriptors began to form one large

cluster without being disjointed by other skill descriptors. The ORG and VOC

descriptors were also found to form two distinct clusters from the early stage of the HCA

solution. Although GRM and MCH descriptors showed some fuzzy areas within their

clusters, they also appeared to possess separate dimensions.

155

Figure 4. CCPROX/HCA results

156

The dimensional hypothesis developed by using exploratory methods was further

tested by using a confirmatory DIMTEST, which examined whether the data held

multidimensionality manifesting the five identified writing skills (CON, ORG, GRM,

VOC, and MCH). In Table 25, the five DIMTEST runs show five rejections of the null

hypothesis of unidimensionality, indicating that the five writing skills are statistically

significant distinct dimensions, p < .01. The dimensionality test statistics, T, further

indicated that CON, ORG, and GRM had a greater magnitude of multidimensionality

than VOC and MCH.

Table 25

Confirmatory DIMTEST Results

Writing skill No. of descriptors T p

CON 8 6.0456 0.00

ORG 6 8.5905 0.00

GRM 12 6.4971 0.00

VOC 5 3.3222 0.00

MCH 4 5.3902 0.00

A set of exploratory and confirmatory dimensionality analyses determined that

ESL academic writing ability comprises five distinct skills, CON, ORG, GRM, VOC,

and MCH, and proved that the underlying dimensional structure of CON and ORG was

distinctively different from that of GRM, VOC, and MCH. This result is consistent with

ESL academic writing theories that characterize writing ability as a constellation of

multiple skills. The dimensionality results further confirmed that the initial Q-matrix

specifying the relationship between the five writing skills and the descriptors was

reasonable, making it possible to begin estimating a diagnostic model.

Diagnostic Function of the EDD Checklist

The diagnostic model of the EDD checklist was examined from a variety of

perspectives, beginning with an examination of model convergence, then proceeding to

parameter estimation and model fit. Diagnostic skills mastery profiles were then

constructed in order to examine its diagnostic capacity.

157

Evaluation of Model Convergence

The Markov Chain length of 20,000 was used with a burn-in length of 10,000.

Figures 5 and 6 show the three different types of plots that were visually inspected to

determine whether steady state had been achieved in the Markov Chain. Two model

parameters, pMCH (pk parameter estimate for MCH; i.e., the proportion of masters for

MCH) and r (r parameter estimate for ORG of D02) were used as an example because

these two parameters showed the most unstable and jumpy chains after the burn-in phase.

The first graphs in Figures 5 and 6 illustrate a density plot, in which one dominant mode

is indicative of model convergence. The second graphs are a time-series plot, which

signals non-convergence when the chains are jumpy or monotonic. The third graphs are

an autocorrelation plot, in which slow convergence is indicated by autocorrelations

greater than 0.2 for lags smaller than 200.

A visual inspection of the plots for pMCH in Figure 5 indicates that although a

slightly unstable and jumpy pattern was observed in the time-series plot, the overall skill

estimation was considered to have converged: a unimodal was found in the density plot,

and the autocorrelations were low after the burn-in phase. Figure 6 illustrates another

possible case of slow convergence in r . Despite the slightly jumpy pattern observed in

the time-series plot, the overall descriptor estimation appeared to have converged as

indicated by the other two plots.

Figure 5. Density, time-series, and autocorrelation plots for pMCH

Figure 6. Density, time-series, and autocorrelation plots for r

Each of the model parameter estimates was examined in this way in order to

determine the convergence. Although a few parameters exhibited evidence of slow

*

2.2

*

*

2.2

*

2.2

158

convergence, as was the case with pMCH and r , the overall pattern of the Markov Chain

plots suggests that convergence had occurred for most of the parameter estimates.

Evaluation of Parameter Estimation

As model convergence had been achieved, the descriptor parameter estimates

were evaluated substantively and statistically to determine the diagnostic quality of each

descriptor relative to its required skills. Table 26 presents a list of initial parameter

estimates, and r , for the 35 descriptors. When the parameter, , was inspected, it

was found that D34 had a value less than 0.6, indicating unlikelihood that students

would correctly execute MCH to appropriately indent the first sentence of each

paragraph in their writing even if they had mastered the skill. Although the MCH skill‟s

weak association with the ability to indent was suspected of contributing to the low

value, the reassignment of the skill was not conducted because of the lack of substantive

evidence supporting such a revision.

Table 26

Initial Descriptor Parameter Estimates

Descriptor r r r r r

D01 0.87 0.68

D02 0.94 0.47 0.81

D03 0.94 0.46 0.69 0.81

D04 0.88 0.74 0.61

D05 0.98 0.10 0.80

D06 0.77 0.46

D07 0.95 0.29 0.86

D08 0.86 0.28

D09 0.91 0.52

D10 0.80 0.06

D11 0.91 0.85 0.30

D12 0.91 0.45

D13 0.83 0.43 0.35

D14 0.78 0.37 0.82 0.88

D15 0.89 0.62

D16 0.99 0.65

D17 0.90 0.78 0.66

*

2.2

*

d*

dk

*

d

*

d

*

d

* *

CON

*

ORG

*

GRM

*

VOC

*

MCH

159



D18 0.87 0.77 0.48

D19 0.84 0.29

D20 0.86 0.46

D21 0.93 0.74

D22 0.96 0.73

D23 0.92 0.56

D24 0.87 0.51

D25 0.91 0.76

D26 0.77 0.12

D27 0.97 0.30

D28 0.83 0.67

D29 0.87 0.27 0.28

D30 0.87 0.41 0.97

D31 0.81 0.60 0.48

D32 0.88 0.27

D33 0.97 0.89 0.59

D34 0.54 0.82 0.67

D35 0.94 0.95 0.94 0.96 0.96 0.80

Of great interest were descriptor parameters with a high r value. Six

parameters exhibited a r value greater than 0.9, making them candidates for

elimination from the initial Q-matrix entries. These parameters were revisited in order to

inspect the model convergence, descriptors-by-skills relationship (ratio), and importance

of each skill to a particular descriptor, after which they were dropped from the Q-matrix

entries in a step-wise manner. Most parameters were removed from D35 because the four

skills, CON, ORG, GRM, and VOC, were found to be non-informative in accurately

estimating the ability to use appropriate tone and register. The insignificant contribution

of VOC was somewhat unexpected because vocabulary knowledge has long been

considered to have an association with tone and register from a theoretical point of view.

The finalized descriptor parameter estimates are presented in Table 27. Most of

the values were close to 1 (except for D34), supporting the robustness of the skills

diagnosis modeling. The low r values were also indicative of high diagnostic power,

suggesting that the parameters contribute much information for distinguishing masters

from non-masters for a particular skill.

* *

CON

*

ORG

*

GRM

*

VOC

*

MCH

*

dk

*

dk

*

d

*

dk

160

Table 27

The Final Descriptor Parameter Estimates


D01 0.87 0.68

D02 0.93 0.48 0.80

D03 0.94 0.47 0.69 0.80

D04 0.88 0.72 0.64

D05 0.97 0.10 0.81

D06 0.77 0.46

D07 0.95 0.29 0.88

D08 0.86 0.28

D09 0.91 0.51

D10 0.79 0.05

D11 0.90 0.84 0.29

D12 0.90 0.45

D13 0.82 0.43 0.34

D14 0.78 0.36 0.81 0.88

D15 0.89 0.62

D16 0.99 0.65

D17 0.91 0.80 0.68

D18 0.88 0.80 0.53

D19 0.83 0.30

D20 0.86 0.46

D21 0.93 0.74

D22 0.96 0.72

D23 0.92 0.54

D24 0.87 0.51

D25 0.91 0.76

D26 0.76 0.12

D27 0.97 0.30

D28 0.82 0.69

D29 0.88 0.24 0.28

D30 0.87 0.39

D31 0.82 0.66 0.51

D32 0.90 0.42

D33 0.96 0.59

D34 0.54 0.84 0.74

D35 0.90 0.76

Note. “” indicates the Q-matrix entries that were dropped due to non-significance.

* *

CON

*

ORG

*

GRM

*

VOC

*

MCH

161

Once the descriptor parameter estimates were evaluated, the skill parameter

estimates were inspected. Figure 7 presents the proportion of masters (pk) across the five

skills in the examinee population. MCH had the highest proportion of masters

(pMCH=0.62), whereas VOC had the lowest proportion of masters (pVOC=0.46).

Considering that skills with low pk correspond to skills that are expected to be difficult

and skills with high pk correspond to skills that are expected to be easy, the result was

interpreted to determine the difficulty hierarchy of the five writing skills. Therefore,

VOC (pVOC=0.46) was considered the most difficult skill, followed by CON (pCON=0.50),

GRM (pGRM=0.53), ORG (pORG=0.58), and MCH (pMCH=0.62). This hierarchy pattern was

consistent with research findings in ESL writing indicating that while ESL learners may

have to make a substantial effort to expand their vocabulary, they acquire mechanical

writing conventions relatively easily (Leki & Carson, 1994; Raimes, 1985; Silva, 1992).

It was also reasonable that content fulfillment was the second most difficult skill, given

that presenting ideas in a logical piece of writing is a cognitively demanding task.

Figure 7. Proportion of skill masters (pk)

Evaluation of Model Fit

As the parameter estimates were determined to be satisfactory, the model fit was

examined using posterior predictive model checking methods. Figure 8 compares the fit

between observed and predicted score distributions. While the predicted score

distributions approximated the observed score distributions, misfit was found at the

lowest and highest distributions, indicating that the model overestimated the low-level

CON (0.50)

ORG (0.58)

GRM (0.53)

VOC (0.46)

MCH (0.62)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Pro

po

rtio

n o

f M

aste

rs (

%)

Skills

162

students and underestimated high-level students. Nonetheless, the misfit was considered

negligible because the overestimated low-level students were still classified as non-

masters on all skills and the underestimated high-level students were still classified as

masters on all skills. The Mean Absolute Difference (MAD) between predicted and

observed item proportion-correct scores was also 0.0027, a negligibly small value

supporting the claim of good fit.

Figure 8. Observed and predicted score distributions

The overall goodness of the model fit was also evaluated by examining whether

a monotonic relationship existed between the number of mastered skills and observed

total scores. A monotonic relationship was assumed to be an indication of good fit.

Figure 9 presents the relationship between the two variables for the 480 students. Each

data point „ ‟ represents a cluster of students, and a master was determined when a

student‟s posterior probability of mastery (ppm) for a skill was greater than 0.6. The

linear relationship between the two variables supported the claim of good fit as

evidenced by the great magnitude of the positive association, Pearson product-moment

correlation coefficient, r = .915, p < .00.

163

Figure 9. The relationship between the number of mastered skills and the total scores

Evaluation of Diagnostic Quality

Performance difference between masters and non-masters

As model convergence and fit were achieved, diagnostic capacity was examined

by comparing the proportion-correct scores of masters and non-masters across the 35

descriptors. A drastic performance difference between masters and non-masters was

assumed to be strong evidence of the descriptors‟ good diagnostic capacity. Figure 10

shows that the descriptor masters have performed decidedly better than descriptor non-

masters. The proportion-correct score differences between masters and non-masters

ranged from 0.14 to 0.82 across the 35 descriptors with the mean of 0.49, suggesting that

the descriptors distinguished masters from non-masters well.

Figure 10. Performance difference between descriptor masters and non-masters

0

5

10

15

20

25

30

35

40

0 1 2 3 4 5

To

tal

Sco

re

Number of Mastered Skills

0.0

0.2

0.4

0.6

0.8

1.0

0 5 10 15 20 25 30 35

Pro

po

rtio

n C

orr

ect

Descriptors

Masters Nonmasters

164

Although the overall diagnostic capacity of the descriptors was satisfactory, an

in-depth analysis was conducted on individual descriptors that were suspected of having

poor diagnostic power. Descriptors were identified with proportion-correct score

differences between masters and non-masters of less than 0.4. Table 28 lists these

descriptors with their proportion correct scores (p-values), s, and r s. Approximately

34% of the total descriptors exhibited poor diagnostic power. The descriptive analysis

indicated that these descriptors were relatively easy compared to other descriptors, with

proportion-correct scores greater than the mean of 0.64 (see the second column for the p-

values for all total students). The relatively high r values were problematic, suggesting

low discriminant function in determining masters and non-masters. D34 in particular

exhibited the poorest diagnostic power, with a proportion-correct score difference

between masters and non-masters of only 0.14 and an extremely low value (0.54) and

high r values (0.84 for ORG and 0.74 for MCH), as well.

Table 28

Descriptors with Poor Diagnostic Power

Descriptor p

r Total Masters Non-masters

D01 0.73 0.87 0.59 0.87 0.68 (CON)

D04 0.65 0.88 0.50 0.88 0.72 (CON), 0.64 (ORG)

D15 0.74 0.90 0.54 0.89 0.62 (GRM)

D16 0.83 1.00 0.62 0.99 0.65 (GRM)

D17 0.73 0.92 0.60 0.91 0.80 (GRM), 0.68 (MCH)

D18 0.67 0.90 0.50 0.88 0.80 (GRM), 0.53 (MCH)

D21 0.82 0.94 0.68 0.93 0.74 (GRM)

D22 0.84 0.96 0.69 0.96 0.72 (GRM)

D25 0.81 0.92 0.69 0.91 0.76 (GRM)

D28 0.68 0.85 0.56 0.82 0.69 (VOC)

D34 0.46 0.52 0.38 0.54 0.84 (ORG), 0.74 (MCH)

D35 0.82 0.92 0.65 0.90 0.76 (MCH)

* *

*

*

*

* *

165

Accuracy of skill mastery classification

a. Number of skill masters

As the overall diagnostic capacity appeared to be satisfactory, writing skill

profiles were constructed by classifying students into master, non-master, or

undetermined groups. Students with a posterior probability of mastery (ppm) greater than

0.6 for a skill were classified as masters of that skill. Those with a ppm lower than 0.4

were classified as non-masters, and those with a ppm between 0.4 and 0.6 were

undetermined (i.e., neither masters nor non-masters). The mastery classification

mirroring the proportion of skill masters (pk) was assumed to support the accuracy of the

diagnostic model. Figure 11 presents the skill mastery classifications for 480 students.

The greatest number of students (n=290) mastered MCH, whereas the smallest number of

students (n=203) mastered VOC. Along the same lines, the smallest number of students

(n=156) did not master MCH, whereas the greatest number of students (n=244) did not

master VOC. This result echoes the findings of the proportion of skill masters (pk)

discussed in Figure 7: the highest probability of mastery was found in MCH (pMCH=0.62),

and the lowest probability of mastery was found in VOC (pVOC=0.46).

Figure 11. Classification of skill mastery

b. Skills probability distribution

The skills mastery classification was further examined on simulated examinee

item response data (n=100,000 simulees). If the estimated model had high diagnostic

231

272248

203

290

238

194216

244

156

11 14 1633 34

0

50

100

150

200

250

300

350

CON ORG GRM VOC MCH

Num

ber

of

Stu

den

ts

Skills

Masters Nonmasters Undetermined

166

function, it was assumed to generate various types of skills mastery profiles, reducing the

possibility of flat skill profiles. The built-in Arpeggio program, Simarpeggulator,

estimated a population probability distribution on the space of all possible 0 (non-

mastery) and 1 (mastery) skill mastery level profile vectors. As the number of skills was

K = 5, the joint population skills distribution consisted of 32 possible mastery profiles

with 0 and 1 vectors: (00000), (00001), (00010), (00100),…, (11111). Figure 12

summarizes the distribution of students across different numbers of mastered skills,

illustrating that an approximately similar proportion of students (ranging from 11.15% to

21.40%) were distributed across the skill categories. The difference between the zero-

skill mastery profile (in which the smallest proportion of students were assigned) and

three-skill mastery profile (in which the largest proportion of students were assigned)

was only 10.25%. It was also notable that the flat skill profiles did not dominate other

skill profiles. As the graph clearly illustrates, students who fell into the flat categories

accounted for only 11.15% and 19.29% of the total, respectively, indicating the high

discriminant function of the estimated skills diagnostic model.

Figure 12. Distribution of the number of mastered skills

c. Most common skill mastery patterns

Figure 13 presents the most common skill mastery pattern in each number of

skill mastery categories. When students mastered only one skill, it was more often

“00001” than other skill profiles such as “10000”, “01000”, “00100”, “00010”, or

11.15

16.14 16.05

21.40

15.98

19.29

0

5

10

15

20

25

0 skill 1 skill 2 skills 3 skills 4 skills 5 skills

Pro

po

rtio

n o

f S

tud

ents

(%

)


167

“00001”. When students mastered two skills, “01001” was the most prevalent skill

mastery pattern. By inspecting the most common skills mastery patterns, it was expected

that the skill difficulty estimated by the diagnostic model would be further supported or

rejected. If the skill difficulty identified in the previous analyses did not hold, the

diagnostic accuracy of the model would have been suspect. Figure 13 shows that students

tended to master easy skills, such as MCH (00001) and ORG (01001), before they

mastered more difficult skills. Of the five skills, VOC was the last skill that students

mastered as indicated by the mastery pattern “11101”. This skill development was

consistent with the skill difficulty (pk) discussed in Figure 7, confirming that VOC is the

most difficult and MCH is the easiest skill. It is also interesting that the six skill mastery

patterns shown in Figure 13 are among the seven most frequent profiles, further

confirming the hierarchy of skill difficulty.

Figure 13. The most common skill mastery pattern in each number of skill mastery

categories

Consistency of skill mastery classification

The diagnostic quality of the estimated model was also evaluated focusing on the

consistency of the skill classification. The built-in Arpeggio program, Tabulator, used

simulated examinee item response data (n=100,000 simulees) to calculate (a) the

proportion of times that each student was classified correctly on the test according to the

known true skill state (correct classification rate: CCR), (b) the proportion of times each

11.15

6.21 5.38

3.50 4.40

19.29

0

5

10

15

20

25

00000 00001 01001 11001 11101 11111

Pro

po

rtio

n o

f S

tud

ents

(%

)

Skill Mastery Pattern

168

student was classified the same on the two parallel tests (test-retest consistency: TRC),

and (c) classification agreement adjusted for by chance. Table 29 presents several

reliability indices of the skill classification using simulated examinee item response data

(n=100,000 simulees). The overall CCR and TRC values were high (M = 0.94 for overall

CCR and M = 0.89 for overall TRC), supporting the consistency of the skill classification.

In particular, CON reported the highest reliability indices among others and MCH

reported the lowest reliability indices. Cohen‟s kappa statistics also echoed the same

results, with substantially high agreement rates across the five skills.

Table 29

Consistency Indices of Skill Classification

Skill Overall

CCR (%)

CCR for

masters (%)

CCR for non-

masters (%)

Cohen‟s

kappa

Overall

TRC

TRC for

masters

TRC for non-

masters

CON 0.97 0.97 0.97 0.94 0.94 0.94 0.95

ORG 0.96 0.96 0.95 0.91 0.92 0.92 0.90

GRM 0.96 0.97 0.95 0.92 0.92 0.94 0.91

VOC 0.93 0.93 0.94 0.87 0.88 0.87 0.88

MCH 0.88 0.93 0.81 0.75 0.80 0.87 0.69

M 0.94 0.95 0.93 0.88 0.89 0.91 0.87

Tabulator also calculated the proportion of examinees whose estimated mastery

classification was correct. In other words, it examined the probability of which an

examinee is estimated as a master when truth is a master, and estimated as a non-master

when truth is a non-master for the mastered skills. Table 30 demonstrates that 96.6% of

the simulees had none or only one error in their estimated skill profiles, indicating that

having more than one incorrect skill mastery classification is very unlikely. The high

correct estimation rates thus further confirmed that the diagnostic skill profiles generated

by the model are reliable.

Table 30

Proportion of Incorrect Patterns Classified by the Number of Skills

No. of mastered skills 0 1 2 3 4 5

Proportion of students (%) 74.0 22.6 03.1 0.3 0.0 0.0

169

Skill mastery profiles across different essay prompts

a. Proportion of skill masters

The diagnostic quality of the estimated model was further examined by focusing

on the extent to which the skill mastery profiles are constructed differently across

different essay prompts. A diagnostically robust model was assumed to generate stable

skill mastery profiles without being affected by the method effect. Figure 14 shows the

proportion of masters across the five writing skills for the subject and cooperation

prompts. The graph clearly illustrates that the two prompts attained a similar proportion

of masters for all the skills but MCH. Although the cooperation prompt showed a bit

higher proportion of masters for CON, GRM, and VOC, the difference appeared to be

negligible; however, the difference in the mastery proportion for MCH appeared to be

substantial (13.33%), indicating that skill‟s unsteady function.

Figure 14. Proportion of masters for the subject and cooperation prompts

b. Most common skill mastery patterns

Fine-grained skill mastery profiles were thought to provide more specific

information about underlying performance differences across the two prompts. Figures

15 and 16 present the most common skills mastery patterns in the different numbers of

skills that students mastered. The high mastery probability of MCH on the cooperation

prompt suggested in Figure 14 was clearly manifested in the specific skills mastery

patterns of “00001”, “01001”, “11001”, “10111”, and “11111” in Figure 16, indicating

0

10

20

30

40

50

60

70

80

CON ORG GRM VOC MCH

Pro

po

rtio

n o

f M

aste

rs (

%)

Skills

Subject Cooperation

170

that MCH is the most preliminary skill and must be mastered prior to other skills on that

prompt.

Figure 15. The most common skill mastery patterns for the subject prompt

Figure 16. The most common skill mastery patterns for the cooperation prompt

c. Number of mastered skills

The extent to which students are likely to master the same number of skills

across different prompts was also expected to provide valuable insights into the

diagnostic capacity of the model. If a significant discrepancy were found across the

prompts, the robustness of the diagnostic model would be suspect. Figure 17 compares

the proportion of masters across the different numbers of skills that students mastered on

13.75

5.83 5.42 3.75

5.00

14.58

0

5

10

15

20

25

00000 01000 01001 00111 11101 11111

Pro

po

rtio

n o

f S

tud

ents

(%

)


13.33

8.75

5.00 5.00 5.00

22.92

0

5

10

15

20

25

00000 00001 01001 11001 10111 11111

Pro

po

rtio

n o

f S

tud

ents

(%

)


171

the two prompts. Although almost the same proportion of students mastered zero, one,

three, or four skills, a notable difference was found in the proportion of students who

mastered two or five skills. The mastery probability for five skills was considerably high

for the cooperation prompt, whereas the mastery probability for two skills was high for

the subject prompt. This performance difference suggests that the diagnostic function of

the model must be carefully reexamined.

Figure 17. Number of mastered skills for the subject and cooperation prompts

Skill mastery profiles across different proficiency levels

a. Overall proportion of skill masters

Student diagnostic skill profiles were examined in order to focus on how skills

mastery profiles differ across different writing proficiency levels. A diagnostically well-

constructed model was assumed to produce skill profiles that had distinctively different

characteristics across different proficiency levels. The 480 students were divided into

three proficiency groups: beginner, intermediate, and advanced. The beginner group was

composed of those whose TOEFL independent writing scores ranged from 1 to 2.5

(n=103), the intermediate group‟s scores ranged from 3 to 3.5 (n=205), and the advanced

group‟s scores ranged from 4 to 5 (n=172). Figure 18 shows the proportion of skill

masters across different proficiency levels, and indicates that the estimated diagnostic

model differentiates substantially among students at different writing proficiency levels.

0

5

10

15

20

25

30


Pro

po

rtio

n o

f S

tud

ents

(%

)


Subject Cooperation

172

The skill mastery probabilities for the intermediate and advanced groups were decidedly

higher than those for the beginner group across the five skills.

Figure 18. Proportion of masters across different proficiency levels

Notably, the beginner and intermediate groups showed a very similar skills

mastery pattern that was distinctively different from that of the advanced group. These

two groups had a higher proportion of masters for ORG (34.95% for the beginner group

and 54.63% for the intermediate group) and MCH (29.13% for the beginner group and

60.00% for the intermediate group), and a lower proportion of masters for GRM (14.56%

for the beginner group and 40.00% for the intermediate group) and VOC (11.65% for the

beginner group and 32.20% for the intermediate group). Conversely, a substantially

higher proportion of students in the advanced group mastered GRM (87.79%) and VOC

(72.67%). These results were consistent with the skill difficulty discussed in the previous

analyses. Considering that VOC was the most difficult skill, it was reasonable that more

proficient students showed a higher probability of mastery for this skill.

b. Proportion of skill masters across different essay prompts

It was also worthwhile to examine whether the three groups maintained

distinctive skill mastery patterns without being affected by the prompt effect. The 240

students in each subject and cooperation prompt group were further divided into beginner,

intermediate, and advanced groups. The subject prompt group consisted of 54 beginner,

0

20

40

60

80

100

CON ORG GRM VOC MCH

Pro

po

rtio

n o

f M

aste

rs (

%)

Skills

Beginner Intermediate Advanced

173

105 intermediate, and 81 advanced students, and the cooperation prompt group consisted

of 49 beginner, 100 intermediate, and 91 advanced students. Figures 19 and 20 compare

the skill mastery patterns for the three groups across the two essay prompts. The overall

skill mastery patterns for the beginner and advanced groups did not differ across the two

prompts: although a slightly higher proportion of students in the cooperation group

mastered MCH (32.65% for the beginner group and 81.32% for the advanced group) than

did students in the subject group (25.93% for the beginner group and 77.78% for the

advanced group), the general skill mastery patterns did not differ significantly. However,

a drastically different skill mastery pattern was observed in the intermediate group,

where the mastery rate for MCH increased remarkably (49.52% for the subject prompt

and 71.00% for the cooperation prompt) and the mastery rate for ORG slightly decreased

(59.05% for the subject prompt and 50.00% for the cooperation prompt) on the

cooperation prompt. The intermediate group‟s high proportion of MCH mastery on the

cooperation prompt might have contributed to the overall high mastery probability for

this skill, as shown in Figure 14.

Figure 19. Proportion of masters across different proficiency levels for the subject

prompt

0

20

40

60

80

100

CON ORG GRM VOC MCH

Pro

po

rtio

n o

f M

aste

rs (

%)

Skills


174

Figure 20. Proportion of masters across different proficiency levels for the cooperation

prompt

c. Number of mastered skills

The association between the number of mastered skills and writing proficiency

levels was also examined, with a positive correlation between the two variables assumed

indicative of a good diagnostic model. As Figure 21 shows, the beginner group exhibited

a constantly decreasing proportion of masters as the number of mastered skills increased.

Although there was a reversing pattern for mastery of four (8.74%) and five (10.68%)

skills, a negative association between the proportion of masters and the number of

mastered skills was apparent. The distribution of masters in the intermediate group

showed a typical bell-curve shape in which most students mastered two or three skills.

The advanced group was somewhat similar to the intermediate group in that most

students mastered two or three skills, but the advanced students showed a markedly

higher proportion of masters of four or five skills than those in the intermediate group.

The general association between the number of mastered skills and writing proficiency

levels thus supports the diagnostic power of the estimated model.

0

20

40

60

80

100

CON ORG GRM VOC MCH

Pro

po

rtio

n o

f M

aste

rs (

%)

Skills


175

Figure 21. Number of mastered skills across different proficiency levels

Case analysis

A case analysis was conducted in order to further examine the quality of the

estimated skill profiles. Six cases were selected whose skill profiles were drastically

different in spite of similar observed scores. Table 31 presents background information

and skill profiles for these cases. The selected cases consisted of four male and two

female students who spoke a variety of native languages. They were awarded similar

observed scores ranging from 21 to 25, but mastered different numbers of skills.

Table 31

Case Profiles

Case

ID Age Gender

Native

language

Observed

score

ETS

score

No. of

mastered

skills

Skill

profile

Undetermined

skill

2207 16 Female Korean 21 3 0 00000 CON, ORG

1092 23 Male Japanese 21 3 1 00001 ORG

1086 33 Female Turkish 21 3 2 01001 None

1133 18 Male Korean 21 3.5 3 01101 None

2178 34 Male Japanese 22 2 4 01111 None

2139 38 Male Spanish 25 4 5 11111 None

Descriptive analysis indicated that while there was a moderately positive

association between observed scores and ETS scores, it was difficult to identify any

0

5

10

15

20

25

30


Pro

po

rtio

n o

f S

tud

ents

(%

)



176

relationship with the skill profiles estimated in the diagnostic model. For example, the

observed score difference between Case 2207 and Case 2139 was only 4 points, but Case

2207 mastered no skills (with CON and ORG undetermined), while Case 2139 mastered

all five skills. Similarly, Case 1092 and Case 1133 had the same observed score of 21,

but their skill profiles drastically differed with regard to “00001” and “01101”.

This discrepancy raises more questions than it answers. Despite the student‟s

attempt to master CON and ORG, Case 2207 was still considered a non-master of all

skills. Conversely, Cases 1092 and 1133 were considered masters of at least one skill. If

a score report describing the discrepancy between total observed score and skill profile

was given to students, it is questionable whether it would be useful for student learning,

since students could be confused or have different interpretations of their writing skills

proficiency.

However, it is also possible that this discrepancy can be interpreted as

highlighting the need for diagnostic skill profiles. The case analysis clearly demonstrated

that students with the same observed scores did not necessarily have the same skill

profiles. Indeed, it generated many different skill profiles highlighting the various

strengths and weaknesses of students‟ ESL academic writing. If a single observed score

was provided to students, it could not really inform them about their writing strengths

and weaknesses because it masks fine-grained, specific diagnostic information. Care

should therefore be taken when a diagnostic score report is created and given to students

who exhibit strikingly different observed scores and estimated skill profiles.

Correlation between EDD Scores and TOEFL Scores

The observed scores awarded by ESL teachers using the EDD checklist were

correlated with the original TOEFL iBT independent writing scores awarded by ETS

raters for the 480 essays. The correlations between the two sets of scores were moderate,

indicating r = .61, p < .01 for the subject prompt and r = .70, p < .01 for the cooperation

prompt. The overall correlation for the 480 essays was also moderate, with r = .66, p

< .01. As was the case with the results discussed in Chapter 5, this moderate correlation

might indicate that, to some extent, the EDD checklist measures the same writing

construct that the TOEFL iBT independent writing rating scale measures. However, it is

177

also possible that the two measures tap different areas of writing construct because the

magnitude of the correlation was not substantially strong. Further evidence is needed to

support or reject the idea that the two measures yield convergent results.

Teacher Perceptions and Evaluations

Teacher Questionnaire Responses

Teachers‟ responses to the questionnaire were descriptively analyzed and

focused on their evaluations of the EDD checklist. Reactions were generally positive,

with no extremes. When asked about their overall satisfaction with using the EDD

checklist in essay assessment, one teacher reported that she liked the checklist “a little

bit”, three teachers liked it “quite a lot”, four liked it “very much” and two liked it

“extremely”. With regard to the descriptors, two teachers reported that they were “quite

clear”, six said they were “very clear” and two said they were “extremely clear”.

When redundancy was examined, eight teachers felt that the descriptors were

“not redundant” and two felt they were “a little bit redundant”. The teachers thought

highly of the usefulness of the descriptors: two considered the checklist “quite useful” six

thought it was “very useful”, and two thought it was “extremely useful”. The checklist‟s

comprehensiveness and relevance to ESL academic writing were also perceived

positively. Only two teachers reported that the checklist was “a little bit comprehensive”

or “quite comprehensive” in capturing all instances of ESL academic writing, and the

remaining eight teachers reported it to be either “very comprehensive” or “extremely

comprehensive”. A similar pattern was observed with regard to the perceived relevance

of the descriptors for ESL academic writing: eight teachers reported that the EDD

descriptors were “very relevant”, and two said they were “extremely relevant”.

Teachers‟ reactions to the binary system used in the checklist were somewhat

heterogeneous. Four teachers reported that the EDD checklist was “a little bit conducive”

or “quite conducive” to making a binary choice, while the remaining six reported it was

“very conducive” or “extremely conducive”. When asked the the number of times they

read the given essays when marking them, five teachers said “twice”, three said “three

times”, and two said “more than three times”. When asked about most or least important

descriptors in developing ESL academic writing, most of the teachers felt that the

178

descriptors related to content fulfillment (D01-D08) and organizational effectiveness

(D09-D14) were most important. A substantial consensus was not reached among the

teachers with regard to the least important descriptor, so that result could not be reported.

Teacher Interviews

The teachers‟ reactions to the EDD checklist were explored in greater detail,

through analysis of their interview data collected during the pilot and main studies. This

reading of the transcripts identified a variety of evaluation themes, including the EDD

checklist‟s strengths and weakness and its diagnostic usefulness for classroom instruction

and assessment. It also showed how teachers‟ perceptions of the EDD checklist changed

over time.

What Are the Strengths of the EDD Checklist?

The checklist‟s comprehensiveness was considered obvious great strength. As

the interview excerpts illustrate, several teachers acknowledged that the checklist

covered multiple aspects of ESL writing:

Researcher

Angelina

Tom

What do you think about the strength of the EDD checklist?

I would say it‟s very comprehensive; it takes a lot of different

aspects into account when it comes to student writing and

assessing student writing. I think that is a strong point.

Absolutely, there is no question about that in my mind that the

amount of information, feedback, and specifics of the feedback are

very positive things for a feedback model. Absolutely, I have

nothing to say about the way you have delineated things.

The checklist‟s fine-grained specific writing subskills also appeared to

successfully guide the teachers‟ assessment. Both Brad and Esther commented that the

breakdown of writing skills helped them know what to look for while assessing essays:

Brad

Esther

I like it because it gives you a regular guideline that when you get

an essay you need to look out for these things. It is useful for that;

otherwise I may forget to check sections.

I thought that this is becoming so obvious because each one has a

glaring thing that is coming out when you look at the list. You can

see a lot of things are ok, but this one is missing the question, etc.

What the essential element is does begin to emerge more clearly

having done a number of them. I think that when using them in a

repeated sense, the problem area does start to pop.

179

Esther commented that once she had internalized the checklist, the evaluation

criteria naturally emerged while she was marking essays, making the rating process go

smoothly (particularly with regard to basic descriptors such as capitalization):

Researcher

Esther

What do you think about the rating scale? Do you like it?

The yes/no scale? I do. The more I used it, the more I liked it. You

had said in your directions to internalize it. Which, of course, you

don‟t do after the first reading. You really only start to internalize

after using it a few times. So after using it a few times, the

taxonomy that you used, those items seemed to jump out of the

paper more easily. Yes, some of the descriptors I really liked and

found so easy to use. I actually really liked some of the basic

descriptors, such as “are there capitals?” It‟s so easy. You know if

they are there or if they are not there. It is internalized really

quickly. Even things like the prepositions, even if the writing was

great, but the prepositions were off.

A similar view was found in Erin‟s interview. She also found that the descriptors

assessing mechanical linguistic knowledge were easy to use, confirming the findings of

overall teacher confidence.

Researcher

Erin

Was the EDD checklist easy to use?

Sometimes it was. „Spelling,‟ „punctuation,‟ „capital letters‟ are

easy to be confident as a clear yes or no. „Indentation,‟ etc,

„articles,‟ some of this grammatical stuff also was quite easy. But,

again, there were some areas that are a bit challenging. Yes, like

“the paragraph is connected to the rest of the essay.” Statements

like that are hard – it depends how you see “connected.” Are you

reading it through logic of ideas, or is it vocabulary connection or

transitional features? Stuff like that. Yes, but I thought it was quite

good.

In addition, the teachers reported that the EDD checklist made themselves more

internally consistent raters. Brad commented that the checklist reduced the randomness in

his ratings.

Brad

What I like about this, I think my marking previously is much more

erratic. It depends a lot on my personal feelings that day – if I‟m in

a good mood, my students will probably do better. If I‟m in a bad

mood, my students will probably do worse and I‟ll notice the

errors. This at least provides some consistency and dampens down

the effects.

180

Greg also said that the unique nature of the checklist helped him to maintain consistency:

Researcher

Greg

How about your consistency?

I thought it was very consistent. I thought there were only maybe 3

times out of 24 essays 35 descriptors when I could say, “Hmm,

I‟m not sure.” In that case, I wrote a little note so that you could

think about it. I thought that with this it was very easy to be

consistent.

In a heated debate, the teachers exhibited drastically different ideas with regard

to the effectiveness of the binary choice system. Angelina expressed her concern about

the lack of a continuum on which writing performance could be measured. She was

dubious about what a yes or no could indicate about writing competence. As she rightly

pointed out, yes does not imply absolute mastery of certain writing skills, and no does not

imply absolute non-mastery:

Researcher

Angelina

Was the yes or no system easy to use?

I think what made this so difficult is because there is no

continuum. There is no medium. What does this mean exactly? Is

this student competent in writing academically? Or not? There is

no continuum. That is why I found that difficult. I think I would

be. I think if there was a scale it would work. I think so. It was

difficult because it was either a yes or a no.

Brad took a similar view, raising the issue of lack of a scale. He reported that

having to make a binary choice increased his psychological load because dichotomizing

language competence into yes or no rendered a huge weight. Brad suggested that a scale

would be psychologically more relaxing and would make it easier to make judgments

quickly; however, he also admitted that the binary option actually forced him to reread an

essay and deliberate more on its quality, which he felt was much fairer than reading an

essay only once, as he would if assessing on a scale. He even speculated that his criticism

of the yes or a no system could be based on his own lack of confidence. Indeed, Brad had

the lowest confidence level of the seven teachers who participated in the pilot study:

Researcher

Brad

What made you so not sure about your decision?

I like to think there is a little bit of flexibility. Then, the student

can see what they did, that they weren‟t all no or yes. I think

because on some of them, if you have that 1-2-3-4 psychologically

with a teacher, then they feel a bit more relaxing. Like if it‟s, okay,

I‟m not quite sure, but I think this is a 2 or 3 instead of making a

definite 2 or 3. For a teacher, it wasn‟t such a convenient system. I

181

Researcher

Brad

Researcher

Brad

Researcher

Brad

found the whole yes and no, I didn‟t particularly enjoy that. I think

it would‟ve been better to have a little bit more of a grading, even

1-5 or 1-4.

So, you still think that you might have felt more comfortable with

a 4-point or 6-point rating scale?

Yes, definitely. I think for teacher and me as well it makes it

easier. If you just say yes or no, it‟s a huge weight. For me, it‟s

easier to put a 4 instead of 5, or a 2 instead of a no. Maybe it had

more due to about my confidence style as a teacher. I don‟t know.

Then, if you had used a scale, do you think it would have taken

less time?

I think I would spend less time with the scale. You would still

have yes or no, but you can break it up slightly. Yes and no is

really a pass or fail. But, with the scale a no could be a 1 or 2 and

when I put a 2, I would feel less like I was really saying no. I

guess it‟s psychological really. But, it‟s quicker to think around a

1, 2, and 4. But here, sometimes I would deliberate for a long time.

There are times that it‟s borderline and I reread it and speculate if

it‟s yes or no. I think I would probably do it faster with a scale.

But, maybe because it‟s yes or no I‟m going back more often to

read the essay. So, maybe it‟s fairer. If I was doing it with the

scale I might read it one time only.

Then, how many levels do you think would be appropriate to

create a scale?

5-point is not good because I could put a 3 too many times. If it is

a 4-point scale it sort of makes me come down on either side. I

know it‟s a 2 or 3 and I‟m more relaxed to put a 2 or a 3. A more

unsure teacher will still have it sort of work out as a yes or no. But,

if you had let me put a 3 there, I would have been in a lot of

danger. I would have put a lot of 3‟s.

Kara‟s stance on the binary system was slightly different. She took a middle of

the road position by noting that a yes or a no option was a fine system although it

required a little practice. She also noted that raters tend to sit on a fence:

Researcher

Kara

What do you think about the binary choice, yes or no? Was it easy

to assess the essays with this system?

I think you need to get your head into the right way of thinking

about it, right? I mean there was a couple times where I would

think about it and go back and change my mind. I would mark it a

yes and then later when I got to another one and marked it a no, I

would go back and change the other one. You realize that if it‟s a

no here, it‟s a no there, but you need to be consistent. But, once

you get it straight in your mind, it‟s not personal; it‟s not against

the person. Either they have it or they don‟t. Honestly, I think it

takes a little practice, but once you get into that mindset of yes or

182

no, it‟s okay. Even with this it was a little challenging because you

can easily fall into the „not quite‟ category a lot. Then after you

look at it and realize… You start to get a feel for it and you rethink

it and start to think, “Maybe it wasn‟t as clear as it could have

been, or as strong as it could have been, or maybe they didn‟t use

transitions as much as I thought they did.”

By contrast, Esther, Mark, and Greg all felt that the binary choice system was

both reliable and convenient. Contrary to what Brad believed, Esther thought that the

binary system actually increased her rating speed. She also pointed out that even if she

had been given a rating scale, she would have needed to distinguish between the middle

categories, a 2 or a 3:

Researcher

Esther

Researcher

Esther

What do you think about the binary choice? Was it easy?

The yes or no aspect? It was quicker. I like that it was quicker.

Is it user-friendly? Would you say it‟s friendly because it doesn‟t

take much time?

At first I thought that it would be really hard. I know when we

initially talked I say, yes or no, but when I looked at the list I

thought, maybe there should be a Likert. And if the Likert were

simple enough, like a 1-4, it would still be user-friendly. But, you

would still be (humming and hawing) between a 2 or a 3. I didn‟t

mind the yes or no. To be honest, I guess I didn‟t. I thought I

would. But, once I started marking it was okay. In addition, a yes

or no increased my speed. I hate to say it because it‟s so much

easier to say yes or no. I loved that it was a yes or no because it

makes it quick and I have tons of them to do. It‟s a quick thing.

For instance, over 50%, you‟re good. However, I think a 4-point

Likert might help. If there was a way of shrinking it and adding a

scale, it might end up the same. Some of them you would be able

to put a 4, but it is tough. Between a 2 and a 3 is still very tough.

How would you distinguish between a 2 or a 3? You can easily

distinguish between a 1 and a 4. That would be the challenge.

Tom and Greg provided deeper insight into the underlying mechanism of the

binary system. Both reported that they were able to answer a yes or a no question

confidently because the checklist had already broken down writing ability into specific

and distinct subskills. Tom went on to say that examining one aspect of writing at a time

helped him to focus, thereby enabling him to answer the descriptors more consistently:

Researcher

Tom

Or if you were given a 4-point rating scale, would you be more

consistent?

No, because the questions are specific enough. If I had to deal with

183

the „language skill‟ lesson and reduce it to only 3 descriptors, then

I would probably start to say I can‟t go yes or no, as there are too

many variables. Once you delineate the variables like this, it

becomes easy to say yes or no. I think the specificity is great

because it tells people where to exactly focus their studies.

Along the same lines, Greg claimed that a holistic scale would have led him to

make vague assumptions about a writer‟s ability instead of focusing on the essay itself.

He further commented that because the EDD checklist broke down writing ability into 35

concrete descriptors, he did not struggle with uncertainty while rating the essays:

Greg Specifically, I‟m talking about the confidence rating. In every

situation, I can honestly say that my confidence rating is 100%.

Honestly. I have seen so many essays that I feel very confident in

my evaluations. Furthermore, it is your 35 points that make sure

the confidence levels are so high, everything is easily evaluated

individually.

Consider a different situation: If I had two essays that were about

the same overall, say something like...

Essay 1

Good length, good form, some good transitions

Basic vocabulary, accurate spelling and punctuation

But short uninteresting sentences, no flow

Basically a very formulaic essay, maybe even a little repetitive

Essay 2

A little short, with many spelling mistakes, mediocre form

Excellent advanced vocab, accurate collocations, generally easily

readable

Well reasoned and supported with great examples, (in other

words), a very thoughtful essay that probably indicates real

mastery and comfort with the language.

These two essays might, overall, rate about the same. Let‟s say

they both came in about a 4 or a 5 out of 6. Well, here a confidence

measurement might be important. If I give Essay 1 a 5, I might be

hesitant about that and say my confidence level was 50%, because

maybe it should have only rated a 4. The content quality was not as

good as the structural quality.

Similarly, with Essay 2, I might give it a 4 if I‟m grading harshly,

but I would not be confident about that because the overall quality

of the essay might have deserved a better score, even if there were

technical problems. If I gave it a 5, then I‟m making assumptions

about the person‟s ability, and not grading the essay itself. So I

wouldn‟t feel confident about that either. But when you break it

down into 35 parts and say, “This essay shows knowledge of

184

English sentence structure.” Yes or No? Well, it‟s easy to see if

they understand the general format, even when there are occasional

mistakes. Yes or No, in my mind, isn‟t complicated by maybes. My

confidence will always be extremely high.

So.... is it going to be a problem that my confidence levels are so

high? Would you like me to reconsider these?

What are the Weaknesses of the EDD Checklist?

Most teachers considered the checklist‟s length to be a serious problem. Brad

commented that the checklist was a time consuming and ineffective way to assess an

essay:

Researcher

Brad

How did you like the checklist?

I was, I like, in terms of like, I think I found it… it‟s kind of a time

consuming way to look for an essay. So yeah, for that reason, I

found it a little bit ineffective.

Esther also remarked on the length of the checklist, though she admitted that she

was not precise in making yes or no judgments for certain descriptors. If issues with an

evaluation criterion did not emerge automatically, she assumed that a student had met

that criterion:

Esther In general, I would say, “I like the checklist,” but there are areas

that are a bit confusing for me. At first I thought it was too long.

It‟s a lot. I found myself being lazy with some of them because I

thought, “Well, it didn‟t jump out at me, so it is fine.” You might

be losing out in those areas. I‟m fully fessing up to it since you are

testing it, but I‟m letting you know. I thought that was there, but I

didn‟t go back because it was yes most of the time. I didn‟t have to

be that precise, so I did find myself (not remembering). If any

pronouns didn‟t match up with their pronouns, it did jump out.

Tom, on the other hand, thought that the lengthy process was an opportunity to

read an essay more thoroughly. This view is congruent with what teachers reported on

the questionnaire: while the checklist is time consuming, they appreciated its

comprehensiveness:

Tom I found the whole process of 35 questions to be very easy. It‟s a

lengthy process, that‟s the real problem. But, the fact that it‟s

lengthy means that we are taking the time to look at it in a detailed

way.

185

Another problem with the EDD checklist was associated with subjectivity. As

with other existing rating scales, the checklist was not free from the perils of subjective

judgment. Angelina questioned the meanings of the words “sophisticated” and “few”,

pointing out that a short essay is more likely to have few errors than a long essay, and

thus it is extremely difficult to define “few” without taking essay length into account:

Researcher

Angelina

Researcher

Angelina

Researcher

Angelina

How is your general evaluation?

When you start marking it, you see how some parts are not

manageable. Certain words, what is considered “sophisticated”

vocabulary? I think that becomes the question and therefore that

leaves for interpretation and some people would interpret it

differently than I would. So, that‟s what I thought. With “few,” it

implies frequency. But, how much is “few”? That might be where

the confusion and difficulty lies in trying to judge. What is “few,”

what is yes, what is no?

Actually, one teacher tried to count all the errors.

I was actually thinking of that! I think that is a very systematic

kind of approach. I think when you do it mathematically, it almost

becomes very reliable. “It is a mathematical equation and this is

how I use it.” I was thinking about doing that, but there were so

many other factors involved.

Right! The problem with that is it really depends on the length of

an essay. What if a student wrote a long essay?

Exactly! So, it all goes back to the length. In that case, they will be

penalized unfairly. It doesn‟t accurately reflect what the student

has done. I felt that, too.

Esther‟s concern was somewhat different. She claimed that the adjective “clear”

was too vague to determine the quality of a thesis statement. However, despite this

limitation, she acknowledged that the checklist did not include too many subjective

indicators compared to other existing rating scales:

Esther For some of these descriptors, the subjective things I found tough.

Like, the thesis statement, almost everything I marked said, “Yes,

there was a clear thesis statement.” But, “Was it a good thesis

statement?” “Absolutely not.” But, “Do I know whether they agree

or disagree?” “Yes.” But, that isn‟t a thesis. It‟s an answer to a

question, but I put yes because I wasn‟t sure. … But, subjective

indicators weren‟t too many in there compared to other scales.

There weren‟t too many in yours that I was confused with. The

ones that I would have been confused with were like, the thesis.

But, probably because “clear” and “good” are different for me.

But, some of them were okay, such as “sophisticated” and

“advanced” was subjective. But, I did know what you meant by

186

that.

The issue of fairness was raised several times during the interviews. Both Brad

and Angelina expressed their concerns about whether student writing ability can be fairly

measured without taking essay length into consideration. As Angelina noted earlier, Brad

also felt that longer essays are generally judged more harshly because they are more

likely to contain mistakes. He also commented that a well-written essay was sometimes

scored worse than a poorly-written one because an advanced writer‟s risk-taking

strategies resulted in a loss of points stemming from additional mistakes. Along those

same lines, Angelina said that a short essay could be harder to judge because it might not

exhibit enough evidence to meet the evaluation criteria:

Researcher

Brad

Angelina

Researcher

Angelina

What are other weaknesses of this checklist?

One of the weaknesses, (we discussed on the way here), I think

there needs to be something about length of the piece. Is it a

suitable length? Some of the short ones may not have the

mistakes, but the longer ones are judged more harshly as they have

more chance to making a mistake. But, maybe they are more of a

correct length. Also, as some essays were much better than other

essays, but came out worse because they made more mistakes.

However, they are trying harder and trying to use interesting, more

expressive language, but in doing so they lose marks on verb

tense.

I think, for example, if they don‟t employ whatever descriptor is

there. For example, length, if it‟s too short to make a judgment on.

I know that for #14, „transitions,‟ if they didn‟t use it, you asked

us not to give them a mark. But, if they used it once, is it

appropriate or inappropriate? Have they employed „transition

devices‟?

Yes, I know what you mean.

I thought it was easier and I felt more comfortable, but still

difficult when you can‟t see evidence of the descriptor.

An in-depth discussion about fairness occurred in Angelina‟s second interview,

when she correctly argued that the EDD checklist was biased in favour of essays that do

not display risk-taking strategies:

Researcher

Angelina

Do you think the EDD checklist fairly assess student writing

ability?

For the binary choice, I thought, no. I thought it was difficult. I

thought the test-taker was unfairly penalized or rewarded because

they didn‟t employ a certain descriptor. If they didn‟t use a writing

187

device, „collocations,‟ „transition devices,‟ etc. Even verb tenses

were difficult. For example, it says the “verb tenses were used

appropriately.” If it was all in the present tense, then yes, sure, but

if the student was using a variety of difference tenses, like, “When

I was a child…” and used a flashback. You know, some anecdotal

story about the past and they made an error… But I wasn‟t sure.

The issue was that this student didn‟t make any errors, but only

used present tense, whereas this student made errors, but used a

variety of tenses and made errors. Obviously this person gets a no,

but this person gets rewarded by sticking to just basic, present

tense. Even the pronouns in reference, some are just not used. I

thought some test-takers were penalized, whereas others weren‟t

because they just didn‟t employ them. That‟s why I found it

difficult to just say yes or no.

Angelina went on to suggest that differential weighting be placed on the

descriptors, commenting that the relative importance across descriptors is different, so

that even if two essays have the same number of descriptors correct, those two students

may have drastically different writing ability. She also felt that a poor essay could be

awarded a high score if simple, mechanical descriptors (such as punctuation and spelling)

were correct while a good essay might be awarded a low score by getting those

descriptors wrong. This point is directly related to the need for diagnostic skill profiles.

As discussed in the case analysis, student writing skill profiles can be drastically different

despite similar observed scores. This indicates that a single observed score could provide

an inaccurate estimate of a student‟s writing proficiency because it masks specific

information about that student‟s strengths and weaknesses:

Angelina I thought the first few questions were probably the most important.

Of course, grammar and spelling are important. Grammar is all

about trying to make a persuasive argument. However, if the

student made spelling errors and it didn‟t obscure what they were

trying to say – I don‟t think that it is that important. I do think

there is a gradient in terms of these descriptors. I think that

definitely the organization, intro, body, conclusion, supporting

ideas and examples are important. Spelling and punctuation are not

as important. That is what I think. It is interesting because I felt

that some students were getting the same number of yes and no

answers. But, I felt it sort of unfair. Just because they can spell

correctly….I thought, “Oh my.” I think certain descriptors should

be weighted more heavily than others to better distinguish the

writer‟s overall writing confidence.

188

Another limitation to the EDD checklist was reported by Greg and Brad, who

commented that some evaluation substance could not be explained using the EDD

checklist. Greg called this an overall impression, noting that simply meeting all the

criteria in the checklist did not necessarily result in a good essay. This point echoes the

holistic claim that a score for the whole is not equal to the sum of separate scores for the

parts (Goulden, 1992). Greg also commented that the most effective feedback method is

to focus on a student‟s single biggest problem instead of focusing on all of them:

Greg I prefer this to the very general TOEFL rating scale, but I still feel

the need for an overall rating score. As I said in our first meeting,

they can have yes, yes, yes, yes. They do things very well, but it is

still a very basic, simple, boring essay. So, on this paper, it looks

like a strong essay, but if you say the overall score is 4 out of 6 and

they have all yes, then it should be perfect. But, maybe not? There

is still something that this very detailed analysis can‟t get. That is

the overall impression. I really wanted to see one more line – the

overall impression, or overall score. Students will then know that it

is their biggest problem. Most of these students have so many

problems. Of course, they are going to try to learn and fix

everything. But, students ask me all the time how they could make

their essay stronger. It is a matter of the „biggest‟ problems being

different.

Like Greg, Brad advocated the need to evaluate the “general feel” of an essay,

arguing that there must be evaluation criteria, no matter how arbitrary, by which teachers

can express their general impressions of an essay. He also suggested that the EDD

checklist should utilize human input, taking how it is operationalized and statistically

justified into consideration. Greg‟s and Brad‟s comments are congruent with White‟s

(1985) holistic view that writing is a unified and central human activity rather than

segments split into detached activities:

Brad

Researcher

I think it also has to be a general feel. It‟s a little bit arbitrary

maybe, but how a teacher feels the piece is on its own; as separate

from this. Generally, a C or B or A. Then sort of see if the

numbers correlate to the grade. But, I think that the human aspect

needs to be taken in as well. How somebody is reading it. At the

moment we are breaking it down to the nitty-gritty and that‟s kind

of good. It does reflect a lot about the writing style, but I also think

it‟s good to have that human input saying, “Generally this feels

like a C, or generally this feels like a B”.

For example, how can you put a general humanistic factor in a

189

Brad

Researcher

Brad

score?

I don‟t know. I think a lot of it comes through just reading it. If

you read a lot of sample essays put together by some sort of

organization ranking them as „A essays,‟ „B essays‟. So, when

reading it, you can figure out the „C Bracket,‟ „D Bracket‟. A lot

of that comes from your sense from reading it. From my first

impression this will be a C, but let me check the breakdown.

Maybe they work in harmony. It‟s not statistically justified; a lot

of it is just instinct.

Yes, I agree. A few teachers also mentioned that it kind of misses

first impressions of writing.

Yes, we can be too analytical and try to break down every point.

But, at the same time I do think this can be valuable information

for the students to have as well. But, it would be nice to have that

“Generally how do you feel about it?” But, I have no idea how you

put that into your scale.

How Do Teachers Evaluate the Diagnostic Function of the EDD Checklist?

When the ways in which teachers perceived the diagnostic function of the EDD

checklist were examined, Greg reported that the granularity of the EDD descriptors was

still too broad to provide detailed feedback, and suggested that more fine-grained

descriptors should be identified:

Greg For example, with punctuation, there are so many kinds of mistakes

that students would make. I read this and I like your checklist, but I

always try to keep in mind that we are writing this to give students

feedback. It is for the student. So, I always think if we can give

more information, it will help them more. For example, if we just

say, “Uses punctuation well? No”, they will ask, “Why, what is the

problem?” Many students use a comma instead of „and,‟ tick that.

You could tell them to be careful as sometimes they do it right,

sometimes they don‟t. Sometimes people use quotation marks, but

it‟s more advanced. Sometimes, in the essay, the punctuation is

perfect, but they need more. Right? Because they missed some

places.

Greg further cautioned that the amount of feedback provided to students must be

carefully determined; for example, not all feedback should necessarily be provided at

once, and different feedback treatment should be available for students at different

proficiency levels:

Greg Different kinds of feedback are important for different students. So,

if it‟s a very mediocre essay, it has so many problems. So, what do

190

you do? Obviously, you want to be honest. In the class, they don‟t

need to know everything. I tell them to focus on structure, commas,

spelling. However, I don‟t teach tone explicitly. I give lots of

examples as to what is good, but unless it‟s something that really

stands out, I don‟t fix that in the beginning. Also, it takes a long

time to teach vocabulary. If there is a very serious problem, I‟ll

write a note. Information about vocabulary is too much. But, for

the very sophisticated essays, then they will be strong with some of

this, so let‟s fix your punctuation or vocabulary problems. If you

feel the need to use idioms, then let‟s make sure you use it right.

So, not all of this is needed at the same time. When I am talking

about an essay, I usually pick the two biggest problems. I hand

them back the essay, and ask them to make edits and give it back to

me.

This is something for teachers to keep in mind and I think that most

teachers will do this automatically. But, sometimes there are just

too many no answers. This won‟t help the student, no. But, if you

use this and you can see the level of the students and choose

particular problems, then it will be very helpful.

Similarly, Kara cautioned that correcting all errors could frustrate and

demoralize students, even if such correction was ultimately necessary to improve their

writing skills:

Researcher

Kara

How would you use the information that gathered from the EDD

checklist?

… I think this is a good balance from the student‟s point of view.

I‟m sure they would be horrified to see all of that. Perhaps you

give it to them one chunk at a time, but I think ideally it has all the

points that they need to think about. Probably I don‟t give them

enough of this kind of thing. Probably not. This just frustrates the

students and demoralizes them. I don‟t correct everything and I

figure there is a limit to what they can take in. I will try to focus

on one or two of the biggest problems rather than trying to correct

them all. Thinking of it this way is going to get you to where you

want to go.

Greg and Kara‟s points are well supported by existing ESL writing research. For

example, Hughey, Wormuth, Hartfiel, and Jacobs (1983) argued that “since an attempt to

teach all of them [the intricacies of ESL writing], along with the other important

processes of writing would overwhelm and discourage writers, ESL teachers need to

emphasize the structures that most affect ESL writers‟ abilities to communicate

effectively in written English” (p. 121). However, the opposite position must also be

191

noted: Ferris (2003) and Hedgcock and Lefkowitz (1994) noted that learners of ESL

writing appreciate teacher feedback on all aspects of their writing, including content,

organization, grammar, and mechanics.

Ann provided some insightful comments suggesting that useful diagnostic

feedback should acknowledge a learner‟s effort, improvement, and progression. She

emphasized the importance of teacher comments and encouragement:

Researcher

Ann

Do you think that EDD checklist provides useful diagnostic

information about the strengths and weakness of students‟ ESL

academic writing?

Yes, but not totally – sometimes things are not as simple as a yes

or no. Also, acknowledgement is needed for effort, improvement,

progress, validation etc. Teacher comments are possibly more

important. Encouragement is a huge motivator.

How Do Teachers Perceive the Diagnostic Function of the EDD Checklist for Classroom

Instruction and Assessment?

Both positive and negative reactions were reported by teachers with regard to the

usefulness of the checklist as a diagnostic classroom assessment tool. Tom thought that

having this sort of diagnostic assessment tool would be a great benefit for teachers at

both the class and individual levels. He even suggested that a class should be designed to

focus not only on the areas in which students experience difficulty, but the areas in which

they require individual attention. He also pointed out that the ability to provide specific

diagnostic information would be particularly useful for motivated students:

Researcher

Tom

Do you think the EDD checklist provides useful diagnostic

information?

Yes, I would think so. Just thinking back to my writing classes, I

found one of the things that were difficult to teach was specific

exercises for individuals, where one person would need a lot of

help on spelling, another would need help on punctuation. A

German student would litter their essay with commas and it‟s

difficult to break them of this habit due to the similarity of

structure. I really had to work with each student individually on

that. That took a lot of one-on-one style teaching for this. Teacher

to student as opposed to teacher to students. I found that this sort

of teaching was almost impossible when it came to these types of

details.

But, if I had a list for every student like this, then I could somehow

put the information (the data) in a computer and could see a bright

192

Researcher

Tom

Researcher

Tom

green line down the page by each student‟s name. That way I

could see that almost all of the students had difficulty with this

particular topic and I could plan a study for 15 minutes in the

class, “just on the use of commas.” After that we could activate the

exercise into something more free following a regular ESL lesson

plan, right? I think this would definitely be useful because the

information in here would be evidence that each student would

benefit from that kind of a study lesson.

That‟s a good point! Glad to hear that.

A specific study lesson to focus on before doing the general

activity. Having some kind of reference to provide a common

denominator would be wonderful.

Absolutely!

For motivated students it‟s fantastic. People are always looking for

feedback, particularly adults. This is what we are talking about –

adults. They want yes or no; this is the way we work. Yes or no? If

it‟s no, go work on it.

A slightly different view was provided by Greg, who cautioned that the vast

amount of diagnostic information would be ineffective for less proficient students, and

that the diagnostic information should be used with care because student performance

does not remain static:

Researcher

Greg

Researcher

Greg

Researcher

Greg

Researcher

Greg

What do you think about using this checklist for classroom-based

instruction?

It would be okay, it would be okay. It‟s not my style, that‟s why I

hesitate. I much prefer to write comments. This is very good

feedback, it‟s very organized. It will help the students, it will. But,

for me, the comments here, “Work on it!” This is the best thing;

this is the good record.

Do you think that providing this much information to every

student might be overwhelming?

I think so, especially for low-level students. On a daily or weekly

basis, it might be too much. But, sometimes on a Saturday, once a

month, my school has a practice TOEFL exam. In this case, if they

were really trying to concentrate on writing a proper essay, then I

would focus on giving them a lot of very detailed feedback.

So, on a monthly basis?

I think so. I think so for all of it. But, on a more frequent basis, it‟s

up to the teacher. It‟s up to the teacher, the level, what you discuss

in class. If you are concentrating on form, maybe you want to give

this part (point out grammar descriptors).

So, you wouldn‟t use this for a daily basis or weekly basis

instruction.

I don‟t think so because the same student can write a very good

193

essay one day and the next day a terrible essay. If I say they are

disorganized, weak thesis, bad example, it‟s just one time and they

probably know.

The need for longitudinal feedback was also echoed in the interview of Kara.

She felt that students could misinterpret the diagnostic information derived from a single

assessment, and expressed her concern about the potential for misrepresentation of the

assessment results in a classroom context:

Researcher

Kara

Researcher

Kara

Researcher

Kara

Researcher

Do you think the diagnostic information generated by the EDD

checklist would be useful for classroom instruction?

I think it may be too limited. I think it is in improvement often

when we mark we don‟t look at all of these points. It is very useful

in this perspective. I would be reluctant to give a student this not

wanting them to think that based on one thing they can do these

things. I would think that is an unfair assumption based on this

data.

This is an assessment right?

You‟re right.

You think this might be unfair in a classroom context.

Yes, in a standardized context, yes, then the framework is a bit

different. In a classroom context, which I‟m most familiar with, I

prefer more of a gentle (not as bold) wording.

This is kind of final.

Esther and Brad also commented that a diagnostic report should be carefully

designed to take the fact that students do not make certain mistakes all the time into

consideration:

Esther

Brad

Since this is meant to be a feedback form, it‟s hard to know

whether someone is doing it all the time.

I think it gives me a little bit of a safety net. That way I‟m not

saying “I know this”. It‟s a bit to cover myself. Sometimes I think

on a lot of these essays it‟s not as though they are making the same

mistake all the time, right? They maybe make the odd mistake and

then they will do it correctly. So, to suddenly tick no that spelling

is bad or they don‟t use their verb tenses correctly, a part of me

thinks that this will discourage the student a little bit, seeing that

written there.

How Do the Teachers‟ Perceptions of the EDD Checklist Change Over Time?

Teachers reported that they were more confident the second time they used the

checklist. In her second interview, Angelina commented that she fully understood that

194

each descriptor was independent of the others, which lessened the psychological pressure

she felt when providing a no response to students:

Researcher

Angelina

How did you like the EDD checklist this time?

I did still find it difficult, but less so maybe than the first time

around. I tried to step back a little bit. I think initially I was really

worried that “it‟s not fair” and that was weighing heavily on me,

but I think the second time around it was a little bit easier. I sort of

have an understanding of where the focus is for each one, so I

don‟t feel as guilty when I put no. I know that just because they

didn‟t satisfy this point, they can still satisfy this one, as they are

now independent from one another. There was so much overlap

before it wasn‟t fair because if I put yes in one spot and they

would get all yes, or if I put no, then it was no all the way down.

This time I felt less pressure because maybe it was a no, but the

next one is independent so I‟ll put a yes there.

Similarly, Sarah was more confident in her ratings in the second assessment

round as she became more familiar with the checklist:

Sarah The first one I did I was unsure. I think I was a bit weaker in the

percentages. By the second one I got more convinced. I was more

familiar with the tool and was more confident that “yes, I‟m seeing

that.” I did feel that it made me focus on specific features of the

writing that I maybe didn‟t focus on exclusively before looking at

it from a holistic perspective.

Overall, teacher responses in the interview suggested that the EDD checklist had

both strengths and weaknesses with regard to assessing student writing performance.

While teachers approved of the comprehensiveness and specificity of the descriptors,

they also pointed out certain problems with subjectivity and fairness. The length of the

checklist and lack of human input were also seen as weaknesses. These weaknesses could

be counter-arguments for the use of the EDD checklist. Teachers had mixed opinions

about the use of the binary choice system: some criticized the lack of a continuum on

which writing performance could be measured, while others acknowledged that

answering yes or no to specific, fine-grained descriptors helped them to be more

internally consistent and focused as raters. While teachers acknowledged that the use of

the checklist could have a positive impact on classroom instruction, they cautioned that

an appropriate amount of diagnostic feedback should be given to students when the

checklist is used because an overwhelming amount of negative feedback (as indicated by

an excessive number of no responses) could frustrate and demoralize them. Teachers also

195

highlighted the need for a longitudinal feedback provision in order to accurately estimate

the progress of student writing performance over time.

Summary

This chapter has discussed three validity assumptions centered on the primary

evaluation of the EDD checklist. Of particular importance was the extent to which

writing skill profiles generated using the EDD checklist provided useful and sufficient

diagnostic information about students‟ strengths and weaknesses in ESL academic

writing. The study‟s overall findings suggested that the estimated diagnostic model is

stable and reliable; although approximately 34% of the descriptors exhibited poor

diagnostic power, the estimated diagnostic model had high discriminant function, with

students in the flat categories accounting for only a portion of the total. The moderate to

slightly high correlation between EDD scores and TOEFL scores also provided

convergent evidence for use of the EDD checklist; however, as discussed in Chapter 5,

this criterion-related validity claim should be interpreted carefully because the two rating

rubrics were developed for different test purposes. Overall teacher evaluation further

justified the validity claims for the use of the checklist. While teachers cautioned that an

appropriate amount of diagnostic feedback should be given to students, they also

acknowledged that the use of the checklist could have a positive impact on classroom

instruction. The next chapter synthesizes the research findings derived from a variety of

assumptions to create a validity narrative.

196

CHAPTER 7

SYNTHESIS

Introduction

This study sought to make multiple validity inferences in order to argue that

scores derived from the EDD checklist can be used to diagnose the domain of writing

skills required in an ESL academic context. Each inference in the interpretive argument

prompted a particular investigation of the checklist‟s development and evaluation

procedures. Underlying inferences were investigated by judging the following

assumptions addressing different aspects of validity claims and requiring different types

of evidence:

The empirically-derived diagnostic descriptors that make up the EDD checklist

are relevant to the construct of ESL academic writing.


teachers and essay prompts.


of ESL academic writing.


writing.


the potential to positively impact teaching and learning ESL academic writing.

This chapter synthesizes the inferences in order to create a validity narrative that

captures the evolving evaluations and interpretations of the checklist‟s use. The five

validity assumptions that formed the central research questions are revisited and critically

reevaluated in relation to the overarching validity argument leading to the potential

consequences. The empirical data and theoretical analyses that served as the backing for

inferences of the interpretive argument are also discussed in light of evidentiary

reasoning. Finally, the implications of future research on the checklist‟s applications in

ESL academic writing are discussed.

Validity Assumptions Revisited

Validity Assumption 1: The empirically-derived diagnostic descriptors that make up the

EDD checklist are relevant to the construct of ESL academic writing.

197

The central focus of skills diagnostic assessment is the extent to which the skills

being assessed reflect knowledge, processes, and strategies consistent with the test

construct in the target domain. It was thus critical to empirically identify assessment

criteria operationalizing the ESL writing skills required in an academic context.

Theoretical analysis was then used to justify and confirm these assessment criteria.

Considering that the construct of ESL writing is multi-faceted and complicated, it was

also important to identify fine-grained and separable assessment criteria so that it would

be possible to implement skills diagnosis modeling. If the ESL writing construct could be

reliably and validly deconstructed and operationalized, then valid inferences could be

made about students‟ ESL writing ability.

To this end, multiple empirical sources were sought from diverse perspectives.

Not only was the incidence of writing performance observed using real student writing

samples, but assessment criteria were elicited from teachers‟ think-aloud verbalization on

ESL essays. As discussed in Chapter 4, these verbal accounts provided rich descriptions

of ESL academic writing subskills and textual features, resulting in 39 descriptors. These

descriptors were empirically-derived, concrete, fine-grained, and consistent with

theoretical accounts of ESL academic writing, addressing all aspects of writing skills (i.e.,

content fulfillment, organizational effectiveness, grammatical knowledge, vocabulary use,

and mechanics). The substantive review and refinement process performed by the ESL

academic writing experts further confirmed the soundness of the descriptors, resulting in

the final 35 descriptors that make up the EDD checklist. The greatest number of

descriptors associated with grammatical knowledge was also reasonable considering that

students greatly desire feedback on grammatical problems in their writing (Cohen &

Cavalcanti, 1990; Ferris, 1995; Hedgcock & Lefkowitz, 1994; Leki, 1991).

A series of exploratory and confirmatory statistical analyses were used to further

characterize the latent dimensional structure of ESL academic writing. Various facets of

writing ability were conceptualized and organized, suggesting that writing competence

does not lie on a single unitary continuum. These findings were consistent with

theoretical accounts of ESL writing, defining writing ability as a constellation of multiple

subskills. As Biber (1988) noted:

198

Linguistic variation in any language is too complex to be analyzed in terms of

any single dimension. The simple fact that such a large number of distinctions

have been proposed by researchers indicates that no single dimension is adequate

in itself. In addition to the distinctions… such as restricted versus elaborated and

formal versus informal, linguistic features vary across age, sex, social class,

occupation, social role, politeness, purpose, topic etc. From a theoretical point of

view, we thus have every expectation that the description of linguistic variation

in a given language will be multidimensional. (p. 22)

This multidimensional view of ESL writing was also consistent with discussions in ESL

writing literature. As noted in Chapter 2, despite different orientations, theoretical

accounts, discourse analysis, and rater perceptions and rating scales provided compelling

bases upon which to define and assess ESL writing abilities.

Of particular interest were the ways in which the EDD checklist differed from

other assessment methods. Although it is similar to most analytic rating scales, the

checklist was able to overcome a number of the limitations of those other scales. For

example, while the checklist conceptualizes ESL academic writing competence in the

same way as most other analytic rating scales (by focusing on such major assessment

criteria as content fulfillment, organizational effectiveness, grammatical knowledge,

vocabulary use, and mechanics), the fine-grained descriptors in the checklist assess

specific writing features that maximize the diagnostic feedback from which students can

benefit. In order to assess grammatical knowledge in an essay, for instance, the checklist

can provide a precise description of the global and local grammatical aspects associated

with syntactic structure, errors of agreement, tense, number, articles, pronouns,

prepositions, and so on.

The evidence gathered throughout the EDD checklist‟s development procedure

suggests that the checklist accurately represents the multidimensional construct of ESL

academic writing. The teachers‟ think-aloud verbal data were a valuable empirical source

that substantiated the construct being measured and provided concrete rationales and

evidence justifying the selected assessment criteria. The theoretical analysis further

confirmed that the checklist was neither atheoretical nor free of theory. This approach

was particularly well-aligned with the concepts of diagnostic assessment because it

enabled teachers to be active generators of assessment criteria and interpreters of

assessment outcomes rather than passive listeners. In a diagnostic assessment framework,

an ongoing dialogue with assessment users and developers can help to create a consensus

199

about the elements to be evaluated, and can help to keep diverse educational clientele

better informed about the assessment outcomes.

Validity Assumption 2: The scores derived from the EDD checklist are generalizable

across different teachers and essay prompts.

The second assumption examined the potential impact of the various sources of

variability associated with sampling conditions of observation. Teacher and essay prompt

facets were the primary sources of variability suspected to prevent accurate inferences

about student ESL academic writing ability. If the student writing scores obtained from a

sample of teachers on a sample of essay prompts cannot be generalized beyond that

specific set of teachers and essay prompts, it will undermine the interpretive argument.

Three approaches were used to explore this suspected variability: (a) teacher internal

consistency, (b) teacher agreement, and (c) descriptor-teacher/essay prompt interaction.

In a Many-faceted Rasch Model (MFRM) analysis, the teacher fit statistics

indicated that all of the teachers exhibited accurate rating patterns when using the EDD

checklist and none exhibited random, halo/central, or extreme rating effects. A bias

analysis further suggested that most teachers were neither positively nor negatively

biased toward any particular descriptors, nor were the essay prompts biased for or against

any descriptors. These results suggest that teachers are able to use the EDD checklist in

an internally consistent manner and that the EDD descriptors function consistently across

different teachers and essay prompts.

However, a mixed result was found when agreement rates among teachers were

investigated. While correlation between a single rater and the rest of the raters (SR/ROR)

indicated that each teacher might have rank-ordered students in a manner similar to that

of the other teachers, teacher agreement statistics reported that for a somewhat low or

moderate percentage of times, each teacher provided exactly the same ratings as another

teacher under identical circumstances. In addition, when teacher agreement rates were

examined at the descriptor level, teachers showed high agreement (> 85%) on descriptors

assessing discrete grammar knowledge, but low agreement (< 70%) on descriptors

assessing global content skills. These results indicate that it might be difficult to claim

that a particular teacher‟s assessment of student writing performance is generalizable

200

beyond that specific teacher. However, as discussed in Chapter 5, the reported reliability

indices must be interpreted carefully because the teachers were not well-trained certified

professional writing assessment raters, and dichotomous ratings (rather than polytomous

ratings) were used in the assessment. The research findings of Barkaoui (2008) and

Knoch (2007) also support the idea that the subjective nature of the task renders it

difficult to achieve high inter-rater agreement in ESL writing assessment.

Overall, the findings of the second assumption present a somewhat fuzzy picture

of the random errors associated with teachers and essay prompts. Unlike traditional

fixed-response assessments (such as multiple-choice tests), the presence of raters and

tasks in performance assessment adds a new dimension of interaction, making it even

more crucial to monitor reliability and validity. A greater number of raters and tasks in an

assessment would be desirable in order to improve consistency from one performance

sample to another, but this is not always possible due to limited resources. The problem

becomes more serious when one considers that the EDD checklist was developed to be

used for diagnostic assessment purposes in a small-scale classroom, where relatively few

resources are allocated. One way of resolving this problem would be to standardize essay

prompts by providing clear specifications. Another way would be to train teachers on a

continuous basis, since effective training would help teachers to use the checklist

consistently and reliably. Care must be taken, however, because high inter-teacher

reliability could counter the contextual validity gained from using the EDD checklist.

The checklist was developed to be used in classroom assessment, which is typically

provided by just one teacher. High inter-teacher reliability would not be crucial in such

cases, and could even threaten the valid use of the checklist.

Validity Assumption 3: Performance on the EDD checklist is related to performance on

other measures of ESL academic writing.

The third assumption is related to concurrent or criterion-related validity and

examined the extent to which the scores awarded using the EDD checklist correlated

with other measures of ESL academic writing. This assumption did not necessarily seek

convergent evidence among different measures of ESL academic writing because a single

measure should not automatically be the norm against which others are compared. The

201

selected measure was the TOEFL independent rating scale, and the correlation between

the two measures was r = .77 in the pilot study and r = .66 in the main study. This

moderate to slightly strong association indicated that a student who received a high score

in EDD assessment would likely receive a high score in TOEFL assessment. It also

suggested that, to some extent, the EDD checklist measures the same ESL academic

writing construct that the TOEFL rating scale measures.

However, different interpretations are also possible. The fact that the magnitude

of the correlation was not very strong suggests that the two measures approached the

ESL academic writing construct in different ways. As White (1985) rightly argued, the

holistic and analytic rating systems upon which the TOEFL rating scale and EDD

checklist are based rely on fundamentally different philosophies. White considered the

act of writing to be a whole human activity that cannot be broken into separate segments.

Goulden (1992) defined the holistic assessment approach in a similar way; that is, a score

for the whole is not equal to the sum of separate scores for the parts. The moderate

association between the two measures could therefore reflect the fundamental differences

underlying the holistic and analytic rating systems. Indeed, as discussed in Chapter 6,

meeting all of the assessment criteria in the EDD checklist does not necessarily result in

a good essay because the checklist does not explicitly measure such elements as overall

impression.

The purposes for which the two measures were developed must also be taken

into account. The EDD checklist diagnoses students‟ ESL writing ability in an academic

context and guides and monitors their writing progress, while the TOEFL rating scale

places students into appropriate ESL writing proficiency levels in order to facilitate

school admission decisions. Therefore, divergent evidence might be more informative if

it was used to highlight these different assessment purposes. If this holds true, it is

reasonable that the two assessment purposes required different ESL writing abilities and

tapped into different aspects of the ESL writing construct.

Another point that demands attention is the different size of the correlation

coefficients. When the TOEFL scores were correlated with the writing proficiency

measures estimated by a MFRM analysis, a greater association was found than with the

observed scores. This might be because the two sets of scores (i.e., estimated and

202

observed) were derived from fundamentally different measurement theories, such as the

classical test theory and item response theory. The MFRM analysis estimated latent

writing ability free from the severity of a particular teacher and free from the difficulty of

an essay prompt and a descriptor, so that estimated writing measures might have more

accurately reflected students‟ true writing ability than the observed scores might have.

On the other hand, the total observed score did not take such assessment conditions into

account, resulting in possibly biased scores. The failure of the score adjustment might

therefore have caused the lower correlation with observed scores. The different number

of teachers involved in the assessment must also be considered; while estimated scores

were computed using ratings from two teachers, observed scores were derived using

rating from a single teacher. This disparity might have affected the size of the correlation

coefficient.

Overall, the third assumption was somewhat difficult to judge due to its

methodological limitations. The most accurate association between different measures

will be found when the same teachers generate two sets of scores while participating in

both assessments. This study was not able to meet these criteria; teachers participated in

only one assessment in which they used only the EDD checklist to assess essays, and

their scores were compared with the original TOEFL scores awarded by ETS raters. This

methodological limitation might have caused confounding results that ultimately

threatened the score interpretations. If time and resources are available, an experimental

study is recommended in order to better understand the ways in which writing

performance assessments made using the EDD checklist relate to those made using other

measures. Comparing EDD scores with teachers‟ classroom assessments might also be

interesting.

Validity Assumption 4: The EDD checklist provides a useful diagnostic skill profile for

ESL academic writing.

The central principle of diagnostic assessment is that it formatively assesses fine-

grained knowledge processes and structures in a test domain, thus providing detailed

information about students‟ understanding of the test materials. The fourth assumption

addressed this thesis, examining the extent to which the writing skill profiles generated

203

using the EDD checklist provided useful and sufficient diagnostic information about

students‟ strengths and weaknesses in ESL academic writing. The Reduced

Reparameterized Unified Model ([Reduced RUM], Hartz, Roussos, & Stout, 2002) was

used to model students‟ writing performance, and its outcomes were evaluated in order to

justify the assumption.

Various skills diagnosis measures supported the stability and accuracy of the

estimated diagnostic model. When model parameters were estimated using a Markov

Chain Monte Carlo (MCMC) algorithm, the overall pattern of the Markov Chain plots

indicated that convergence had occurred for most of the parameter estimates. The

descriptor parameters also supported the robustness and informativeness of the estimated

model by having most of the values close to 1 and r values smaller than 0.9; only

six descriptor parameters were eliminated from the initial Q-matrix entries. Overall

goodness of model fit was also satisfactory, with predicted score distributions

approximating observed score distributions.

The hierarchy of skill difficulty provided the most insightful findings, as the four

different analytic methods used all echoed the same result. The first approach was

associated with the proportion of skill masters (pk), suggesting that vocabulary use was

the most difficult skill, followed by content fulfillment, grammatical knowledge,

organizational effectiveness, and mechanics. This result was confirmed using skill

mastery classifications that showed that the greatest number of students mastered

mechanics, while the smallest number of students mastered vocabulary use. The most

common skill mastery pattern in each number of categories also provided additional

evidence; mechanics was the first skill that students mastered, as indicated by the

mastery pattern “00001”, while vocabulary use was the last skill that students mastered,

as indicated by “11101”. Finally, the skill mastery pattern across writing proficiency

levels showed that a substantial number of students in the advanced group had mastered

more difficult skills such as grammatical knowledge and vocabulary use, while the

majority of those in the beginner group mastered easier skills such as organizational

effectiveness and mechanics. These psychometric findings were consistent with ESL

writing research indicating that vocabulary use and content are the essential elements

characterizing high-level essays (Milanovic et al., 1996).

*

d*

dk

204

The skills probability distribution also supported the diagnostic quality of the

student writing skill profiles. The estimated model generated various types of skills

mastery profiles; students who fell into the flat categories accounted for only some

portion of the total, indicating the model‟s high discriminant function. The consistency of

the skill classification was another valuable source that supported the quality of the

estimated model. Although mechanics showed slightly lower reliability indices compared

to other skills, overall consistency was high, further confirming that the estimated model

generated reliable diagnostic skill profiles.

Despite these encouraging results, evidence was also found that could undermine

the validity claim. Approximately 34% of the descriptors exhibited poor diagnostic

power, failing to effectively discriminate masters from non-masters. These descriptors

were relatively easy compared to others, with proportion-correct scores greater than the

mean of the total descriptors. In particular, D34 (indentation) appeared the most

problematic. It had an extremely low value, suggesting that students who had

mastered mechanics were not likely to correctly execute that skill in order to

appropriately indent the first sentence of each paragraph in their writing. This finding is

consistent with Polio‟s (2001) speculation that mechanics consists of various

heterogeneous components (such as indentation, capitalization, spelling, and

punctuation), and that it is difficult to form a unitary construct.

The instability of mechanics was also observed in further analysis. When the

skill mastery profiles were examined across essay prompts, a similar mastery proportion

was found for all skills but mechanics. Specifically, students in the intermediate group

exhibited a drastically different skill mastery pattern for mechanics across the two essay

prompts. These results were somewhat unexpected, since the bias analysis examining the

interaction between the descriptors and the essay prompts did not find any evidence that

the descriptors functioned differently across the two different essay prompts. One

possible answer might be that the size of bias was negligible at the descriptor level, but

when the descriptors were added up to form a skill, it became substantial. More

psychometrically rigorous analyses could better address the interaction effect.

The results of the case analysis provided more questions than answers. The six

selected cases had drastically different skill profiles despite their similar observed scores.

*

205

If students were provided with both a total observed score and estimated skill profile,

they could be confused or have different interpretations of their writing skills proficiency.

However, it is also possible that this discrepancy could be interpreted as highlighting the

need for diagnostic skill profiles. Unlike a single observed score, which masks precise

and detailed information, diagnostic skill profiles specifically point to the areas in which

students show strengths and weaknesses. Care should therefore be taken when a

diagnostic score report is created and given to students.

The diagnostic usefulness of ESL academic writing skill profiles was examined

based upon both positive and negative evidence needed to justify the interpretive

argument. With a few exceptions, the results of the psychometric analyses suggested that

the estimated diagnostic model was robust, providing useful and sufficient diagnostic

information about student ESL academic writing performance. However, solely

quantitative evidence would be limited to supporting the validity claim. The next validity

assumption touches upon qualitative findings, focusing on the potential impact and

consequences of using the checklist in ESL academic writing classes.

Validity Assumption 5: The EDD checklist helps teachers make appropriate diagnostic

decisions and has the potential to positively impact teaching and learning ESL academic

writing.

The fifth and final assumption concerned the extent to which the EDD checklist

helped teachers make appropriate and confident diagnostic decisions, and gave teachers a

positive perception of the checklist‟s diagnostic usefulness. If teachers reported that the

checklist helped them to make appropriate and confident diagnostic decisions and had the

potential to positively impact diagnosing ESL academic writing skills and improving

their instructional practices, it would support the validity claim. Teacher perceptions and

judgments about the use of the EDD checklist were explored primarily using

questionnaire and interview data.

Of the many comments collected about the checklist and its use, a few are worth

noting. Teachers generally considered the checklist to be an effective diagnostic

assessment tool. As discussed in Chapter 6, they appreciated the checklist‟s

comprehensiveness and acknowledged that it covered multiple aspects of ESL academic

206

writing. They also commented that the breakdown of writing skills helped them know

what to look for when assessing essays. Similarly positive evaluations were made in the

questionnaire responses, with most teachers reporting that the checklist was clear and

understandable and that they liked using it. Reported teacher confidence in using the

checklist was also high.

However, teachers also raised some potential issues, specifically whether the

checklist could help them to make appropriate diagnostic decisions. Of particular concern

was that student writing ability could not be fairly measured without taking essay length

into consideration. According to their observations, longer essays tended to be judged

more harshly because writers of longer essays had more opportunity to make mistakes.

Well-written essays were also sometimes scored worse than poorly-written ones because

the risk-taking strategies of more advanced writers resulted in additional mistakes and a

resultant loss of points. In other cases, shorter essays were harder to judge because they

did not provide enough evidence to meet the evaluation criteria. These problems were all

related to the characteristics of analytic rating system; the act of writing might remain

more than the sum of its parts and the analytic approach might not be able to

appropriately capture students‟ genuine writing ability, resulting in biased scores. From a

different perspective, these problems are also directly related to the need for diagnostic

skill profiles. As discussed in the case analysis, student writing skill profiles can be

drastically different despite similar observed scores. This indicates that a single observed

score could provide an inaccurate estimate of a student‟s writing proficiency because it

masks specific information about that student‟s strengths and weaknesses.

The potential impact of EDD assessment drew pointed attention. Teachers

generally felt that using the checklist could have a positive impact on their classroom

instruction. One teacher noted that teachers could greatly benefit from this kind of

diagnostic assessment tool because it would help them to identify not only the areas in

which students are experiencing difficulty, but also the areas in which they require

individual attention. Teachers also felt that using the checklist could have some negative

impact; a few cautioned that too much diagnostic feedback (such as marking all grammar

errors) could demotivate and disempower students, and that the amount and nature of the

feedback offered should be carefully determined as a result. Teachers also suggested that

207

not all feedback should be provided at once, since students at different proficiency levels

would require different treatment. Indeed, some teachers claimed that specific diagnostic

information would be particularly useful for motivated students, but less effective for less

proficient students.

Despite the teachers‟ reluctance to provide detailed and thorough feedback to

students, research suggests that students want to receive substantial feedback from their

teachers. Surveys on student feedback preferences have found that students are inclined

to receive, attend to, and address feedback on all aspects of their writing (Cohen &

Cavalcanti, 1990; Ferris, 1995, 2003; Ferris & Roberts, 2001; Hedgcock & Lefkowitz,

1994; Hyland, 1998; Lee, 2004; Leki, 1991; Zhang, 1995). Similarly, research findings

on the effect of teacher feedback are somewhat contradictory; while some researchers

have urged teachers to provide one type of feedback at a time (e.g., Hughey et al., 1983),

others have noted that ESL writing students can deal with multiple types of feedback on

the same draft (e.g., Boiarsky, 1984; Fathman & Whalley, 1990). Although the findings

are inconclusive, a consensus appears to have been reached in ESL writing literature;

students appreciate clear, concrete, and specific feedback (Ferris, 1995; Straub, 1997). If

the students‟ needs for diagnostic feedback are taken seriously, the EDD checklist could

be used to provide such feedback.

The intended and unintended consequences of EDD assessment are another point

worth noting. Teachers cautioned that student writing ability should not be determined by

a single assessment because they do not make the same mistakes all the time and their

performance does not remain static. It was also pointed out that students might

misinterpret the specific diagnostic information derived from a single assessment. If the

diagnostic feedback provided to students is outdated and does not capture student writing

progress appropriately, it could unintentionally deliver the wrong message. However, if

teachers provide longitudinal feedback in a timely manner on multiple drafts that take the

theories of development in ESL writing into consideration, these unintended negative

consequences would be reduced.

208

Implications

Theoretical Implications

Usefulness of an Empirical Approach to Scale Development

The results of this study support the idea that an empirical approach is useful

when developing an assessment scheme (Brindley, 1998; Fulcher, 1987, 1993, 1996b,

1997; Upshur & Turner, 1995, 1999). As many researchers have pointed out, the most

serious problem with intuition-based or a priori rating scales is that it is not always clear

how the scale descriptors were created (or assembled) and calibrated (e.g., Brindley,

1998; Chalhoub-Deville, 1997; de Jong, 1988; Lantolf & Frawley, 1985; North, 1993;

Pienemann, Johnson, & Brindley, 1988; Upshur & Turner, 1995). In light of these

problems, this study aimed to demonstrate the benefits of an assessment scheme

developed using an empirical approach. Not only did teachers‟ think-aloud verbal

protocols provide rich verbal descriptions of the assessment criteria to be assessed, but a

series of conditional covariance-based nonparametric dimensionality techniques were

utilized to empirically identify their dimensional structure. A theoretical analysis further

confirmed the assessment criteria. These findings demonstrate the effectiveness of the

empirical approach to assessment scheme development, and underscore the importance

of its use for diagnostic purposes, as the identification of specific assessment elements is

the most important procedure in implementing diagnostic assessment.

Integration of Feedback Research in L2 Writing and the Diagnostic Approach in

Educational Assessment

This study also filled the gap between feedback research in L2 writing and the

diagnostic approach in educational assessment. Although they have the same overarching

goal, the focus of research in these two areas lies in different directions. Most feedback

research in L2 writing examines the effect of different types of feedback using a

qualitative method or case studies, while diagnostic educational assessment is focused

primarily on developing and implementing a psychometric diagnostic model using large-

scale test data. Recent technological advancements integrating diagnostic feedback to L2

writing also have certain limitations; automated feedback programs, such as the E-Rater

and Criterion®

developed by the Educational Testing Service (ETS), are limited to

209

assessment of writing constructs, since they are focused on more narrowly defined

artifacts of ESL writing skills (Hyland & Hyland, 2006).

This study attempted to expand the scope of feedback research in L2 writing by

introducing a new measurement technique, cognitive diagnostic assessment (CDA). The

CDA technique used in this study, the Reduced RUM, provided a robust diagnostic

model that generated useful and sufficient diagnostic skill profiles for student ESL

academic writing performance. The findings from this study suggest that a psychometric

diagnostic model can be applied to feedback research in L2 writing in order to

formatively assesses fine-grained ESL writing processes and structures in a test domain,

thereby opening a much-needed avenue for additional research in this area.

Classification of L2 Writing Scales

This study further reconceptualized current L2 writing scale classifications.

Despite an increasing need for diagnostic assessment, very few scales (e.g., Knoch‟s

[2007] diagnostic ESL academic writing scale) have been developed to offer such

assessment in L2 academic writing. In the L2 writing assessment literature, rating scales

are classified primarily as holistic, analytic, or primary trait scales (based on scoring

methods) or as user-oriented, assessor-oriented, or constructor-oriented (based on

assessment purpose), with little consideration to their formative or summative nature. In

response, this study developed and validated a diagnostic ESL writing assessment

scheme, contributing to the current L2 writing scale literature.

Usefulness of Argument-based Approaches to Validity

By building and supporting arguments for the score-based interpretation and use

of the EDD checklist in ESL academic writing, this study demonstrated the usefulness of

an argument-based approach to validity. First proposed by Kane (1992), it suggests that

test-score interpretation is associated with a chain of interpretive arguments, and that the

plausibility of those arguments determines the validity of test-score interpretations. In

this study, the central research questions were formulated based upon the logical process

of argument-based approach to validity, guiding a set of comprehensive procedures for

the development of the checklist and justifying its score-based interpretations and uses.

210

Although this study did not explicitly propose rebuttals (i.e., counter-arguments) for the

use of the EDD checklist, this evidentiary reasoning process made it possible to address

various aspects of validity inferences and to examine assumptions pertaining to different

types of evidence. It also demonstrated that a coherent and unified set of procedures

guide test developers and help assessment users to formulate and justify their

interpretations and assessment decisions (Bachman, 2005; Kane, 2001). This argument-

based approach to validation provides an overarching framework that could offer greater

insight into ESL research problems.

Practical Implications

Development of a Diagnostic Score Report Card

A well-developed diagnostic assessment scheme can make major contributions

to instructional practice and can have direct implications for student learning. It will be

useful not only for ESL teachers to identify the areas in which ESL students most need

improvement and track their progress, but also for ESL students themselves to monitor

and guide their learning processes. The diagnostic approach will also be of value to

curriculum developers, who are charged with designing effective ESL curricula in order

to maximize educational benefits.

One way of providing such a benefit is through the development of a diagnostic

score report card. As discussed earlier, effective diagnostic feedback is characterized as

something that is concrete, descriptive, fairly direct, and addresses all aspects of the

performance to be assessed, so that students can interpret the results and take appropriate

future action (Alderson, 2007; Black & William, 1998; Ferris, 1995, 2003; Shohamy,

1992; Spolsky, 1990; Straub, 1997). At the same time, the purpose of diagnostic

feedback is to inform diverse teaching and learning stake-holders (Nichols, 1994;

Nichols, Chipman, & Brennan, 1995; Leighton & Gierl, 2007; Pellegrino & Chudowsky,

2003). As Shohamy (1992) argued,

The main reason that tests can be useful is that they can provide administrators,

teachers, and students with valuable information about and insight into teaching

and learning. This information can then be utilized to improve learning and

teaching. For example, information obtained from tests can provide evidence of

students‟ ability over a whole range of skills and subskills, achievement and

proficiency, and on a continuous basis. (p. 514)

211

If student performance can be tracked over time taking effort, improvement, and

progression into account, the positive impact on both teaching and learning would be

enormous. Along the same lines, if different types of diagnostic information can be

delivered to different types of stake-holders, diagnostic feedback would be maximized.

A hypothetical score report card was developed to provide a student named

Junko with diagnostic information about her ESL writing performance (see Figure 22).

She was assumed to have written an essay on one of the two prompts used in this study.

An adaptation of Jang‟s (2005, 2009a) DiagnOsis, this hypothetical version consisted of

four parts: (a) overall writing ability, (b) writing score, (c) writing skills profile, and (d)

writing skills that need to be improved, with each part written in simple enough language

for Junko to understand the report easily. The first part of the report card, Your Overall

Writing Ability, describes Junko‟s overall performance, pointing out the writing skills in

which she is most and least proficient. Notably, overall holistic writing proficiency levels

such as Level 1, Level 2, Level 3, or Level 4 are not reported, since the skills diagnosis

approach does not render a single composite score or level that can mask student‟s

strengths and weaknesses.

The second part of the report card, Your Writing Score, presents the number of

points earned by Junko across the 35 descriptors, classified into easy, medium, and

difficult categories based upon the difficulty measures estimated by the MFRM analysis.

The cut-off for each category was determined by visually inspecting the borderline in

which three descriptor clusters were distinctively divided. The easy category included 7

descriptors with difficulty measures ranging from -1.82 to -0.64 logits; the medium

category included 16 descriptors with difficulty measures ranging from 1.03 to 1.19

logits; and the difficult category included 12 descriptors with difficulty measures ranging

from 0.92 to 1.09 logits. The subscores that Junko earned across the three categories are

also reported. As with overall skill proficiency, however, the observed total score is not

reported because it could bias Junko‟s true writing ability. As the case analysis

demonstrated earlier, Junko‟s writing competence would differ fundamentally from that

of someone with the same total score who used descriptors with different difficulty

measures.

212

The third part of the report card, Your Writing Skills Profile, provides a detailed

description of Junko‟s performance across five writing skills, with a bar graph

summarizing her mastery level for each skill. Instructions on how to read the graph are

provided next to it. The graph further classifies Junko‟s writing performance into mastery,

undetermined, and non-mastery states. A detailed description of her performance is then

provided for each skill on the next two pages. It is noteworthy that proficiency levels are

attached to each skill based upon Junko‟s posterior probability of mastery (ppm) for the

five writing skills. A carefully-designed standard-setting procedure might help to

accurately determine her skill proficiency levels for the five writing skills. In each skill

category, the skill definition is presented along with the characteristics of a competent

writer in that skill. Specific descriptors from which Junko earned points were also

presented.

The fourth and final part of the report card describes Writing Skills that Need to

be Improved, and provides examples of ways in which they can be improved. If Junko‟s

ESL writing teacher then expands this part with detailed guidelines, she could take

appropriate future action.

Care must be taken in using the diagnostic assessment report card. First and

foremost, the report card should be used to inform Junko, but not other stake-holders nor

for other purposes. As Alderson (1991) noted, assessment purposes and audiences are the

critical factors that must be considered in any assessment context. A different type of

score report should thus be developed to provide other stake-holders with diagnostic

information about Junko‟s writing performance. As such, it should be noted that the EDD

checklist was developed for the use in an academic context, and should not be used

without considering the context in which an assessment takes place. Second, the ways in

which Junko‟s writing skill mastery is classified and interpreted need to be carefully

determined. The Reduced RUM provides a limited mastery standing, including only

mastery, non-mastery, and undetermined states. A more finely-classified mastery

standing might be more informative for use in describing student ESL academic writing

ability. Finally, technological advances could enable this report card to be incorporated

into computer-assisted assessment, allowing writing samples to be automatically scored

and students to receive immediate individualized diagnostic feedback on their writing.

213

Diagnostic ESL Writing Profile Student Name: Junko Sawaki

Your Overall Writing Ability

Your Writing Score

Descriptor (D) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

Your Score

Difficulty M M D M D D D D E D M M M D M E M M D M E E M M E D D M D M D M E M E

Got a point Didn’t get a point Didn’t apply

D M E

Difficult Medium Easy

You got points for 7/12 difficult descriptors. 7/16 medium descriptors. 6/7 easy descriptors.

Your Writing Skills Profile

Five writing skills were assessed based on the essay that you wrote. These are Content Fulfillment (CON), Organizational Effectiveness (ORG), Grammatical Knowledge (GRM), Vocabulary Use (VOC), and Mechanics (MCH).

Figure 22. An example of the diagnostic ESL writing profile

0.0 0.2 0.4 0.6 0.8 1.0

MCH

VOC

GRM

ORG

CON

Probability

Wri

tin

g Skill

s

How to interpret the graph o The graph illustrates the degree to which you have mastered each of

the five writing skills. o If the bar does not reach 0.4 of the probability area, you might need to

improve the skill. o If the bar lies between 0.4 and 0.6 of the probability area, it is difficult

to determine your level of mastery.

o If the bar stretches beyond 0.6 of the probability area, you may have mastered that particular skill.

You demonstrated excellent grammar and vocabulary knowledge in your essay. You were also able to apply a variety of English grammar rules effectively and use the appropriate words in the given context. However, your writing skills were relatively weak in terms of constructing good content, organizing the structure of your essay, and following English writing conventions. In particular, you were least successful at presenting a clear topic sentence, presenting a unified idea in each paragraph, and expanding your ideas well throughout each paragraph.

214

CONTENT FULFILLMENT ★★★☆*

Content fulfillment assesses the degree to which a writer satisfactorily addresses a given topic. A writer who shows strength in this area generally demonstrates an excellent understanding of the topic by presenting clear and substantial arguments supported by specific examples.

* The number of black stars indicates the level of skill proficiency (e.g., ★★★☆ = Level 3). ** The superscript number indicates the number of a descriptor.

ORGANIZATIONAL EFFECTIVENESS ★☆☆☆

Organizational effectiveness assesses the way in which a writer organizes and develops his or her ideas. A writer who is competent in this area generally demonstrates the ability to construct and develop a paragraph effectively and to connect textual elements well, both within and between paragraphs using appropriate cohesive and transitional devices.

GRAMMATICAL KNOWLEDGE ★★★★

Grammatical knowledge assesses the extent to which a writer demonstrates consistent ability to properly apply the rules of English grammar. A well-written essay adheres to English grammar rules with full flexibility and accuracy, and displays a variety of syntactic structures and few linguistic errors.

You might need more work in:

1** Understanding a given question and answering accordingly.

2 Writing a clear essay that can be read without causing any

comprehension problems for readers.

4 Presenting a clear thesis statement.

6 Providing enough supporting ideas and examples.

8 Providing specific and detailed supporting ideas and examples.

You might be able to:

9 Organize your ideas into paragraphs and include an introductory

paragraph, a body, and a concluding paragraph.

12 Connect each paragraph to the rest of the essay.

14 Use linking words effectively.


10 Presenting a clear topic sentence that ties to supporting sentences in

each body paragraph.

11 Presenting one distinct and unified idea in each paragraph.

13 Developing or expanding ideas well throughout each paragraph.


17 Making complete sentences.

18 Connecting independent clauses correctly.

19 Using grammatical or linguistic features correctly in order not to

impede comprehension.

20 Using verb tenses appropriately.

25 Making pronouns agree with their referents.


3 Write concisely and present few redundant ideas or linguistic

expressions.

5 Make strong arguments.

7 Provide appropriate and logical supporting ideas and examples.


15 Use a variety of sentence structures.

16 Demonstrate an understanding of English word order.

21 Demonstrate consistent subject-verb agreement.

22 Use singular and plural nouns appropriately.

23 Use prepositions appropriately.

24 Use articles appropriately.

215

VOCABULARY USE ★★★★

Vocabulary use assesses the extent to which a writer demonstrates great depth and breadth of vocabulary knowledge. A writer who is strong in this area generally uses a broad range of sophisticated words, knows how to combine words, and displays accurate knowledge of word form and usage.

MECHANICS ★★★☆

Mechanics assesses the extent to which a writer follows the conventions of English academic writing. A writer who is strong in this area generally demonstrates correct use of spelling, punctuation, capitalization, and indentation.

Writing Skills that Need to be Improved

Figure 22. An example of the diagnostic ESL writing profile (Continued)


26 Using sophisticated or advanced vocabulary.


31 Spell words correctly.

33 Use capital letters appropriately.

34 Indent each paragraph appropriately.

35 Use appropriate tone and register throughout the essay.


32 Using punctuation marks appropriately.

Learn more about how to effectively organize an essay structure. Before writing an essay, you might need to think about what the thesis of your essay is and what the topic sentences for each paragraph will be. A good essay generally presents a clear thesis statement in the introduction and topic sentences at the beginning of the body paragraphs. Also, try to present one distinct idea in each paragraph. When more than one idea is presented in a single paragraph, it weakens your arguments and causes comprehension difficulties for readers. In addition, whenever you present your idea, try to expand it fully throughout each paragraph. If you develop your writing skills in these areas, you will be a more competent writer!


27 Use a wide range of vocabulary.

28 Choose appropriate vocabulary to convey the intended meaning.

29 Combine and use words appropriately.

30 Use appropriate word forms (noun, verb, adjective, adverb, etc).

216

Suggestions for Future Research

This study has addressed different aspects of research reporting on issues from

three intersecting areas: (a) ESL academic writing, (b) diagnostic assessment, and (c)

empirical methods in scale development. Despite increasing interest in and need for a

diagnostic approach, few diagnostic assessment schemes have been developed that align

with the empirical and theoretical sources of ESL academic writing. In response, a new

diagnostic assessment scheme for ESL academic writing, called the Empirically-derived

Descriptor-based Diagnostic (EDD) checklist, was developed and its score-based

interpretations and uses were validated. The checklist‟s validation process opened

possibilities for future research in the areas discussed below.

If data could have been collected from students, the EDD checklist would have

incorporated their cognitive processes. Although the checklist was constructed using

think-aloud verbal data from teachers, which focused on what they considered important,

it was unable to fully reflect the actual writing knowledge, processes, and strategies

exhibited by students in their writing. If these writing processes could have been

observed using students‟ introspective or retrospective verbal protocols, the writing

abilities to be assessed might have been better understood, and a more valid assessment

tool would have been created. Research calling for greater incorporation of students‟

perspectives is scarce in current scale development literature, and further research is

warranted in this area.

Another area for further investigation is the application of CDA models to

polytomous data. Although the teachers were fully aware that writing competence cannot

be dichotomized, they were asked to make binary choices while using the checklist. This

lack of a continuum on which writing performance can be measured increased their

psychological load and resulted in low inter-teacher reliability. Although teachers‟ rating

data could have been gathered using a scale and then artificially dichotomized (e.g.,

“strongly disagree” and “somewhat disagree” = no, “strongly agree” and “somewhat

agree” = yes), this method was not considered because it could distort teachers‟ decisions

and manipulate the data. The lack of a scale in the checklist was primarily because most

current CDA models do not deal with polytomous data, and while a few (e.g., RUM

[Hartz et al., 2002], General Diagnostic Model [von Davier, 2005]) have begun to take

217

polytomous data into account, their robustness has not been intensively examined with

real item response data. The sample size needed to handle polytomous data was another

concern, since those models require a larger sample in order to estimate the greater

number of parameters. If these problems can be resolved, the EDD checklist could be

revised to include a scale that enables teachers to make multi-level decisions about

student writing performance.

Along the same lines, further consideration is needed as to whether theories of

ESL writing development truly correspond to the underlying assumptions of the CDA

model used in this study. The Reduced RUM uses a discrete representation of knowledge

structure, dichotomizing skill mastery probability into mastery and non-mastery.

However, it is doubtful that such dichotomization is conceptually possible in ESL writing

assessment. For example, if a student is judged to be a master of grammatical knowledge,

does it mean that he or she has absolutely mastered that skill, and does not need to work

on that area further? It would be interesting to explore the potential of latent trait CDA

models (e.g., MIRT-C [Reckase & McKinley, 1991], MIRT-NC [Sympson, 1977]), so

that student knowledge structures can be scaled according to a continuous ability

continuum. Similarly, interaction among writing skills must be further examined.

Although the Reduced RUM assumed the conjunctive interaction among writing skills, it

was not substantively examined. If students can get a point on a descriptor without

executing all of the skills required for that descriptor, compensatory CDA models will be

equally suitable for estimating student writing ability. Future research could investigate

which CDA model best represent theories of ESL writing.

Of particular interest are potential applications for CDA in the area of integrative

assessment. Alderson (2005) speculated that a diagnostic approach is not encouraged in

direct writing assessment because it is tailored to assess discrete low-level language

abilities rather than integrative high-order skills. Despite this assumption, this study used

a CDA model to diagnose student ESL writing competence. The EDD checklist broke

down global writing ability into specific measurable elements, with performance

assessed based upon the extent to which these elements were mastered. However,

whether a discrete-point method can ever define and assess the construct of ESL writing

remains unclear. The extent to which the discrete-point method in a CDA model can

218

explain the act of human writing encountered in real-life situations is also uncertain. If

these problems continue to threaten the knowledge representation and the authenticity of

assessments, further research could explore alternative ways of diagnosing high-order

global writing ability.

A new diagnostic ESL writing test created in collaboration with teachers and test

developers would also be interesting. The current practice of retrofitting CDA models to

existing non-diagnostic tests is problematic (DiBello et al., 2007; Jang, 2009a; Lee &

Sawaki, 2009b); indeed, the ESL essays used to develop the EDD checklist were

originally a part of a non-diagnostic test, so it is not known whether the checklist would

take a different form if essays written for diagnostic purposes were used. A computer

technology-assisted assessment system could be a promising resource for effectively

delivering new diagnostic tests and feedback in this regard. In an ESL writing context, it

could mean that students would be asked to complete a writing task online, and would

receive immediate feedback tailored to their performance. The authenticity of such a test

would be greater if it can present the target language features found in real-life situations,

such as sending an email or posting comments on a web site.

The final recommendation is associated with the EDD checklist‟s use in real

classroom teaching and learning setting. This study examined the checklist‟s

effectiveness with limited use and was not able to explore how the checklist would be

used in a real ESL academic writing class due to logistical problems. The next logical

step would thus be to observe how teachers and students (who did not participate in the

current study) might use the checklist in actual practice, in order to interpret assessment

outcomes from a longitudinal perspective. Teachers might want to use the checklist to

track students‟ writing performance over time, so that students receive both short- and

long-term feedback. This continued investigation would be particularly important,

considering that current research in ESL writing focuses on process-oriented writing

instruction in which students revise and resubmit multiple drafts of their work. As

Watanabe (2004) noted, long-term washback may be challenged without examining

substantial continued effects of an assessment.

219

REFERENCES

Alderson, J. C. (1990a). Testing reading comprehension skills (Part one). Reading in a

Foreign Language, 6, 425-438.

Alderson, J. C. (1990b). Testing reading comprehension skills (Part two): Getting

students to talk about taking a reading test (A pilot study). Reading in a Foreign

Language, 7, 465-503.

Alderson, J. C. (1991). Bands and scores. In J. C. Alderson & B. North (Ed.), Language

testing in the 1990s (pp. 71-86). London: Macmillan.

Alderson, J. C. (2005). Diagnosing foreign language proficiency: The Interface between

Learning and assessment. London: Continuum.

Alderson, J. C. (2007). The challenge of (diagnostic) testing: Do we know what we are

measuring? In J. Fox, M. Wesche, D. Bayliss, L. Cheng, C. Turner, & C. Doe

(Ed.), Language testing reconsidered (pp. 21-39). Ottawa: University of Ottawa

Press.

Alderson, J. S., & Lukmani, Y. (1989). Cognition and Reading: Cognitive levels as

embodied in test questions. Reading in a Foreign Language, 5, 253-270.

American Council on the Teaching of Foreign Languages (ACTFL). (2001). ACTFL

proficiency guidelines. Hastings-on-Hudson, NY: ACTFL.

Arnaud, P. J. L. (1992). Objective lexical and grammatical characteristics of L2 written

composition and the validity of separate-component tests. In P. J. L. Arnaud & H.

Béjoint (Ed.), Vocabulary and applied linguistics (pp. 133-145). London:

Macmillan.

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford

University Press.

Bachman, L. F. (2003). Constructing an assessment use argument and supporting claims

about test taker-assessment task interactions in evidence-centered assessment

design. Measurement: Interdisciplinary Research and Perspectives, 1, 63-65.

Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment

Quarterly, 2, 1-34.

Bachman, L. F., & Palmer, A. S. (1982). The construct validation of some components of

communicative proficiency. TESOL Quarterly, 16, 449-464.

Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford

University Press.

Bachman, L. F., & Savignon, S. J. (1986). The evaluation of communicative language

220

proficiency: A critique of the ACTFL Oral Interview. The Modern Language

Journal, 70, 380-390.

Bardovi-Harlig, K. (1992). A second look at T-unit analysis. Reconsidering the sentence.

TESOL Quarterly, 26, 390-395.

Bardovi-Harlig, K., & Bofman, T. (1989). Attainment of syntactic and morphological

accuracy by advanced language learners. Studies in Second Language

Acquisition, 11, 17-34.

Barkaoui, K. (2008). Effects of scoring method and rater experience on ESL essay rating

processes and outcomes. Unpublished doctoral dissertation. University of

Toronto. Canada.

Beaman, K. (1984). Coordination and subordination revisited: Syntactic complexity in

spoken and written narrative discourse. In L. Hamp-Lyons (Ed.), Assessing ESL

writing in academic contexts (pp. 37-49). Norwood, NJ: Ablex.

Bereiter, C., & Scardamalia, M. (1987). The psychology of written composition. Hillsdale,

NJ: Lawrence Erlbaum Associates.

Bernhardt, E. B. (1984). Toward an information processing perspective in foreign

language reading. The Modern Language Journal, 68, 322-331.

Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University

Press.

Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in

Education, 5, 7-74.

Boiarsky, C. (1984). What the authorities tell us about teaching writing. Journal of

Teaching Writing, 3, 213-223.

Brindley, G. (1998). Describing language development? Rating scales and SLA. In L. F.

Bachman & A. D. Cohen (Ed.), Interfaces between second language acquisition

and language testing research (pp. 112-140). Cambridge: Cambridge University

Press.

Brown, J. D., & Bailey, K. (1984). A categorical instrument for scoring second language

writing skills. Language Learning, 34, 21-42.

Buck, G., & Tatsuoka, K. (1998). Application of rule-space methodology to listening test

data. Language Testing, 15, 118-142.

Canale, M. (1983). From communicative competence to communicative language

pedagogy. In J. C. Richards & R. W. Schmidt (Ed.), Language and

communication (pp. 2-27). London: Longman.

221

Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to

second language teaching and testing. Applied Linguistics, 1, 1-47.

Casanave, C. P. (1994). Language development in students‟ journals. Journal of Second

Language Writing, 3, 179-201.

Celce-Murcia, M., & Larsen-Freeman, D. (1999). The grammar book: An ESL/EFL

teachers‟ course. Boston: Heinle & Heinle.

Chalhoub-Deville, M. (1997). Theoretical models, assessment frameworks and test

construction. Language Testing, 14, 16-33.

Charney, D. (1984). The validity of using holistic scoring to evaluate writing: A critical

overview. Research in the Teaching of English, 18, 65-81.

Cobb, T. (2006). Classic VP English version 3.0. Retrieved February 03, 2009, from

http://www.lextutor.ca/vp/.

Cohen, A. D., & Cavalcanti, M. (1990). Feedback on compositions: Teacher and student

verbal reports. In B. Kroll (Ed.), Second language writing: Research insights for

the classroom (pp. 155-177). Cambridge: Cambridge University Press.

Connor, U., & Carrell, P. (1993). The interpretation of tasks by writers and readers in

holistically rated direct assessments of writing. In J. Carson & I. Leki (Ed.),

Reading in the composition classroom (pp. 141-160). Boston: Heinle.

Cooper, C. R. (1977). Holistic evaluation of writing. In C. R. Cooper & L. Odell (Ed.),

Evaluating writing: Describing, measuring, judging (pp. 3-31). Urbana, IL:

National Council of Teachers of English.

Cooper, T. C. (1976). Measuring written syntactic patterns of second language learners of

German. The Journal of Educational Research, 69, 176-183.

Cooper, T. C. (1981). Sentence combining: An experiment in teaching writing. The

Modern Language Journal, 65, 158-165.

Council of Europe. (2001). The Common European Framework of Reference for

Languages: learning, teaching and assessment. Cambridge: Cambridge

University Press.

Creswell, J.W. (2003). Research design: Qualitative, quantitative and mixed methods

approaches. Thousand Oaks, California: Sage.

Crismore, A., Markkanen, R., & Steffensen, M. S. (1993). Metadiscourse in persuasive

writing: A study of texts written by American and Finnish university students.

Written Communication, 10, 39-71.

Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational

222

measurement (2nd

ed.) (pp. 443-507). Washington, DC: American Council on

Education.

Cronbach, L. J. (1982). Designing evaluations of educational and social programs. San

Francisco: Jossey-Bass.

Cronbach, L. J. (1988). Five perspectives on validity argument. In H. Wainer (Ed.), Test

validity (pp. 3-17). Hillsdale, NJ: Erlbaum.

Crooks, T., Kane, M., & Cohen, A. (1996). Threats to the valid use of assessment.

Assessment in Education, 3, 265-285.

Crowhurst, M. (1987). Cohesion in argument and narration at three grade levels.

Research in the Teaching of English, 21, 185-201.

Cumming, A. (1990). Expertise in evaluating second language compositions. Language

Testing, 7, 31-51.

Cumming, A. (1997). The testing of writing in a second language. In C. Clapham & D.

Corson (Ed.), Encyclopedia of language and education: Volume 7 Language

testing and assessment (pp. 51-63). Dordrecht, Netherlands: Kluwer.

Cumming, A. (1998). Theoretical perspectives on writing. Annual Review of Applied

Linguistics, 18, 61-78.

Cumming, A. (2001). The difficulty of standards, for example in L2 writing. In T. Silva

& P. Matsuda (Ed.) On second language writing (pp. 209-229). Mahwah, NJ:

Lawrence Erlbaum.

Cumming, A. (2002). Assessing L2 writing: Alternative constructs and ethical dilemmas.

Assessing Writing, 8, 73-83.

Cumming, A., Kantor, R., & Powers, D. E. (2001). Scoring TOEFL essays and TOEFL

2000 prototype writing tasks: An investigation into raters' decision making and

development of a preliminary analytic framework. TOEFL Monograph Series 22.

Princeton, New Jersey: Educational Testing Service.

Cumming, A., Kantor, R., & Powers, D. E. (2002). Decision making while rating

ESL/EFL writing tasks: A descriptive framework. The Modern Language

Journal, 86, 67-96.

Cumming, A., Kantor, R., Powers, D. Santos, T., & Taylor, C. (2000). TOEFL 2000

writing framework: A working paper. TOEFL Monograph Series, Report No. 18.

Princeton, NJ: Educational Testing Service.

Cumming, A., & Riazi, A. M. (2000). Building models of adult second language writing

instruction. Learning and Instruction 10, 55-71.

223

Dandonoli, P., & Henning, G. (1990). An investigation of the construct validity of the

ACFTL proficiency guidelines and oral interview procedure. Foreign Language

Annals, 23, 11-22.

Davies, A., Brown, A., Elder, C., Hill, K., Lumley, T., & McNamara, T. (1999).

Dictionary of language testing. Cambridge: Cambridge University Press.

de Jong, J. (1988). Rating scales and listening comprehension. Australian Review of

Applied Linguistics, 11, 73-87.

DiBello, L. V., Roussos, L. A., & Stout, W. (2007). Review of cognitively diagnostic

assessment and a summary of psychometric models. In C. R. Rao & S. Sinharay

(Ed.), Handbook of statistics, Volume 26, Psychometrics (pp. 979-1030).

Amsterdam, The Netherlands: Elsevier.

DiBello, L. V., & Stout, W. (2007). Guest editor‟s introduction and overview: IRT-based

cognitive diagnostic models and related methods. Journal of Educational

Measurement, 44, 285-291.

DiBello, L. V., & Stout, W. (2008). Arpeggio documentation and analyst manual (Ver.

3.1.001) [Computer software]. St. Paul: MN: Assessment Systems Corporation.

DiBello, L. V., Stout, W., & Roussos, L. A. (1995). Unified cognitive/psychometric

diagnostic assessment likelihood-based classification techniques. In Nichols P.

D., Chipman, S. F., Brennan, R. L. (Ed.), Cognitively diagnostic assessment (pp.

361-389). Erlbaum, Mahwah, NJ.

Douglas, J., Kim, H-R., Roussos, L., Stout, W., & Zhang, J. (1999). LSAT dimensionality

analysis for December 1991, June 1992, and October 1992 administrations

(Law School Admission Council Statistical Report 95-05). Newton, PA: LSAT.

Dulay, H., Burt, M., & Kreashen, S. (1982). Language Two. New York: Oxford

University Press.

Educational Testing Service (ETS). (2007). TOEFL iBT Tips: How to prepare for the

TOEFL iBT. Princeton, NJ: Educational Testing Service. Retrieved July 3, 2008,

from http://www.ets.org/Media/Tests/TOEFL/pdf/TOEFL_Tips.pdf

Engber, C. (1995). The relationship of lexical proficiency to the quality of ESL

compositions. Journal of Second Language Writing, 4, 139-155.

Ericsson, K. A., & Simon, H. A. (1993). Protocol analysis: Verbal reports as data.

Cambridge, MA: MIT Press.

Evola, J., Mamer, E., & Lentz, B. (1980). Discrete point versus global scoring for

cohesive devices. In J. W. Oller & K. Perkins (Ed.), Research in language testing

(pp. 177-181). Rowley, MA: Newbury House.

224

Fathman, A. K., & Whalley, E. (1990). Teacher response to student writing: Focus on

form versus content. In B. Kroll (Ed.), Second language writing: Research

insights for the classroom (pp. 178-190). Cambridge, UK: Cambridge University

Press.

Ferris, D. (1995). Student reactions to teacher response in multiple-draft composition

classrooms. TESOL Quarterly, 29, 33-53.

Ferris, D. (2003). Responding to writing. In. B. Kroll (Ed.), Exploring the dynamics of

second language writing (pp. 119-140). New York: Cambridge University Press.

Ferris, D., & B. Roberts (2001). Error feedback in L2 writing classes: How explicit does

it need to be? Journal of Second Language Writing, 10, 161-184.

Figueras, N., North, B., Takala, S., Verhelst, N., & Avermaet, P. (2005). Relating

examinations to the common European framework: A manual. Language Testing,

22, 261-279.

Fischer, G. H. (1973). The linear logistic test model as an instrument in educational

research. Acta Psychologia, 37, 359–374.

Fischer, R. A. (1984). Testing written communicative competence in French. The Modern

Language Journal, 68, 13-20.

Fitzgerald, J., & Spiegel, D. L. (1986). Textual cohesion and coherence in children‟s

writing. Research in the Teaching of English, 20, 263-280.

Flower, L., & Hayes, J. (1981). A cognitive process theory of writing. College

Composition and Communication, 32, 365-387.

Foster, P., & Skehan, P. (1996). The influence of planning and task type on second

language performance. Studies in Second Language Acquisition, 18, 299-323.

Fournier, P. (2003). Blueprints: A guide to correct writing. Saint-Laurent, Quebec:

Pearson Longman.

Friedlander, A. (1990). Composing in English: Effects of a first language on writing in

English as a second language. In B. Kroll (Ed.), Second language writing:

Research insight for the classroom (pp. 109-125). Cambridge: Cambridge

University Press.

Fulcher, G. (1987). Tests of oral performance: The need for data-based criteria. English

Language Teaching Journal, 41, 287-291.

Fulcher, G. (1993). The construction and validation of rating scales for oral tests in

English as a foreign language. Unpublished doctoral dissertation, University of

Lancaster, UK.

225

Fulcher, G. (1996a). Invalidating validity claims for the ACTFL oral rating scale. System,

24, 163-172.

Fulcher, G. (1996b). Does thick description lead to smart tests? A data-based approach to

rating scale construction. Language Testing, 13, 208-238.

Fulcher, G. (1997). The testing of L2 speaking. In C. Clapham & D. Corson (Ed.),

Encyclopedia of language and education: Volume 7 Language testing and

assessment (pp. 75-85). London: Kluwer.

Fulcher, G. (2003). Testing second language speaking. London: Pearson Longman.

Fulcher, G., & Davidson, F. (2007). Language testing and assessment: An advanced

resource book. London & New York: Routledge.

Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (1995). Bayesian data analysis.

London: Chapman and Hall.

Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple

sequences. Statistical Science, 7, 457-511.

Glaser, B., & Strauss, A. L. (1967). The discovery of grounded theory: Strategies for

qualitative research. Chicago; IL: Aldine.

Goulden, N. R. (1992). Theory and vocabulary for communication assessments.

Communication Education, 41, 258-269.

Goulden, N. R. (1994). Relationship of analytic and holistic methods to rater‟s scores for

speeches. The Journal of Research and Development in Education, 27, 73-82.

Grabe, W. (2001). Notes toward a theory of second language writing. In T. Silva & P.

Matsuda (Ed.) On second language writing (pp. 39-57). Mahwah, NJ: Lawrence

Erlbaum.

Grabe, W., & Kaplan, R. (1996). Theory and practice of writing. New York: Longman.

Greene, J. C., Caracelli, V. J., & Graham, W. F. (1989). Toward a conceptual framework

for mixed-method evaluation design. Educational Evaluation and Policy

Analysis, 11, 255-74.

Grove, E., & Brown, A. (2001). Tasks and criteria in a test of oral communication skills

for first-year health science students. Melbourne Papers in Language Testing, 10,

37-47.

Haertel, E. H. (1989). Using restricted latent class models to map skill structure of

achievement items. Journal of Educational Measurement, 26, 301-321.

Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English. London: Longman.

226

Hamp-Lyons, L. (1991). Scoring procedures for ESL contexts. In L. Hamp-Lyons (Ed.),

Assessing second language writing in academic contexts (pp. 241- 276).

Norwood, NJ: Ablex.

Hamp-Lyons, L. (1995). Rating nonnative writing: The trouble with holistic scoring.

TESOL Quarterly, 29, 759-762.

Hamp-Lyons, L., & Henning, G. (1991). Communicative writing profiles: An

investigation of the transferability of a multiple-trait scoring instrument across

ESL writing assessment contexts. Language Learning, 41, 337-373.

Hamp-Lyons, L., & Kroll, B. (1997). TOEFL 2000 writing: Composition, community,

and assessment. TOEFL Monograph Series Report No. 5. Princeton, NJ:

Educational Testing Service.

Harley, B., & King, M. L. (1989). Verb lexis in the written composition of young L2

learners. Studies in Second Language Acquisition, 11, 415-440.

Hartz, S. M. (2002). A Bayesian framework for the Unified Model for assessing cognitive

abilities: Blending theory with practicality. Unpublished doctoral dissertation.

University of Illinois at Urbana Champaign.

Hartz, S. M., Roussos, L., & Stout, W. (2002). Skills diagnosis: Theory and practice.

Unpublished manuscript. University of Illinois at Urbana Champaign.

Hayes, J. R. (1996). A new framework for understanding cognition and affect in writing.

In C. M. Levy & S. Ransdell (Ed.), The science of writing (pp. 1-27). Mahwah,

NJ: Lawrence Erlbaum Associates.

Hedgcock, J., & Lefkowitz, N. (1994). Feedback on feedback: Assessing learner

receptivity to teacher response in L2 composing. Journal of Second Language

Writing, 3, 141-163.

Hendrickson, J. M. (1980). Error correction in foreign language teaching: Recent theory,

research, and practice. In K. Croft (Ed.), Readings on English as a second

language. Cambridge, Mass: Winthrop Publishers.

Hinkel, E. (2003). Simplicity without elegance: Features of sentences in L1 and L2

academic texts. TESOL Quarterly, 37, 275-300.

Hinkel, E. (2004). Teaching academic writing: Practical techniques in vocabulary and

grammar. Mahwah: Lawrence Erlbaum Associates.

Homburg, T. (1984). Holistic evaluation of ESL compositions: Can it be validated

objectively? TESOL Quarterly, 18, 87-107.

House, E. R. (1980). Evaluating with validity. Beverly Hills, CA: Sage.

227

Hughey, J. B., Wormuth, D. R., Hartfiel, V. F., & Jacobs, H. L. (1983). Teaching ESL

composition: Principles and techniques. Rowley, MA: Newbury House.

Hunt, K. W. (1970). Recent measures in syntactic development. In M. Lester (Ed.),

Readings in applied transformation grammar (pp. 179-192). New York: Holt,

Rinehart and Winston.

Huot, B. (1996). Toward a new theory of writing assessment. College Composition and

Communication, 47, 549-566.

Hyland, F. (1998). The impact of teacher written feedback on individual writers. Journal

of Second Language Writing, 7, 255-286.

Hyland, K. & Hyland, F. (2006). Feedback on second language students‟ writing.

Language Teaching, 3, 83-101.

Ingram, D. E. (1984). Introduction to the ASLPR. In Commonwealth of Australia,

Department of Immigration and Ethnic Affairs, Australian Second Language

Proficiency Ratings (pp. 1-29). Canberra: Australia Government Publishing

Service.

Intaraprawat, P., & Steffensen, M. S. (1995). The use of metadiscourse in good and poor

ESL essays. Journal of Second Language Writing, 4, 253-272.

Ishikawa, S. (1995). Objective measurement of low-proficiency EFL narrative writing.

Journal of Second Language Writing, 4, 51-70.

Jacobs, H., Zinkgraf, S., Wormuth, D., Hartfiel, V., & Hughey, J. (1981). Testing ESL

composition: A practical approach. Rowley, MA: Newbury House.

Jafarpur, A. (1991). Cohesiveness as a basis for evaluating compositions. System, 19,

459-465.

Jang, E. E. (2005). A validity narrative: the effects of cognitive reading skills diagnosis

on ESL adult learners‟ reading comprehension ability in the context of Next

Generation TOEFL. Unpublished doctoral dissertation. University of Illinois at

Urbana Champaign.

Jang, E. E. (2008). A framework for cognitive diagnostic assessment. In C. A. Chapelle,

Y.‐R. Chung, & J. Xu (Ed.), Towards adaptive CALL: Natural language

processing for diagnostic language assessment (pp. 117‐131). Ames, IA: Iowa

State University.

Jang, E. E. (2009a). Cognitive diagnostic assessment of L2 reading comprehension

ability: Validity arguments for Fusion Model application to LanguEdge

assessment. Language Testing, 26, 31-73.

Jang, E. E. (2009b). Demystifying a Q-Matrix for making diagnostic inferences about L2

228

reading skills. Language Assessment Quarterly, 6, 210-238.

Junker, B., & Sijtsma, K. (2001). Cognitive assessment models with few assumptions,

and connections with nonparametric item response theory. Applied

Psychological Measurement, 25, 258–272.

Kameen, P. (1979). Syntactic skill and ESL writing quality. In C. Yorio, K. Perkinds, & J.

Schachter (Ed.), On TESOL 79: The learner in focus (pp. 343-364). Washington,

DC: TESOL.

Kane, M. (1992). An argument-based approach to validity. Psychological Bulletin, 112,

527-535.

Kane, M. (1994). Validating interpretive arguments for licensure and certification

examinations. Evaluation and the Health Professions, 17, 133-159.

Kane, M. (2001). Current concerns in validity theory. Journal of Educational

Measurement, 38, 319-342.

Kane, M. (2002). Validating high-stakes testing programs. Educational Measurement:

Issues and Practice, 21, 31-41.

Kane, M. (2004). Certification testing as an illustration of argument-based validation.

Measurement: Interdisciplinary Research and Perspectives, 2, 135-170.

Kane, M., Crooks, T., & Cohen, A. (1999). Validating measures of performance.

Educational Measurement: Issues and Practice, 18, 5-17.

Kasai, M. (1997). Application of the rule space model to the reading comprehension

section of the test of English as a foreign language (TOEFL). Unpublished

doctoral dissertation. University of Illinois at Urbana Champaign.

Kellogg, R. (1996). A model of working memory in writing. In C. M. Levy & S. Ransdell

(Ed.), The science of writing (pp. 57-71). Mahwah, NJ: Lawrence Erlbaum

Associates.

Kepner, C. (1991). An experiment in the relationship of types of written feedback to the

development of second-language writing skills. The Modern Language Journal,

75, 305-313.

Knoch, U. (2007). Diagnostic writing assessment: The development and validation of a

rating scale. Unpublished doctoral dissertation. University of Auckland. New

Zealand.

Kunnan, A. J. (2004). Test fairness. In M. Milanovic & C. Weir (Ed.), European

language testing in a global context (pp. 27-48). Cambridge, UK: Cambridge

University Press.

229

Kunnan, A. J., & Jang, E. E. (2009). Diagnostic feedback in language assessment. In M.

Long & C. Doughty (Eds.), Handbook of second and foreign language teaching

(pp. 610–625). Walden, MA: Wiley-Blackwell.

Lantolf, J. P., & Frawley, W. (1985). Oral proficiency testing: A critical analysis. The


Larsen-Freeman, D. (1978). An ESL index of development. TESOL Quarterly, 12, 439-

448.

Larsen-Freeman, D. (1983). Assessing global second language proficiency. In H. W.

Seliger & M. Long (Ed.), Classroom-oriented research in second language

acquisition (pp. 287-304). Rowley, MA: Newbury House.

Larsen-Freeman, D., & Strom, V. (1977). The construction of a second language

acquisition index of development. Language Learning, 27, 123-134.

Laufer, B. (1991). The development of L2 lexis in the expression of the advanced learner.

The Modern Language Journal, 75, 440-448.

Laufer, B., & Nation, P. (1995). Vocabulary size and use: Lexical richness in L2 written

production. Applied Linguistics, 16, 307-322.

Lautamatti, L. (1978). Observations on the development of the topic in simplified

discourse. In V. Kohonen & N. E. Enkvist (Ed.), Text linguistics, Cognitive

learning, and language teaching (pp. 71-104). Turku, Finland: Afinla .

Lautamatti, L. (1987). Observations on the development of the topic of simplified

discourse. In U. Connor & R. B. Kaplan (Ed.), Writing across languages:

Analysis of L2 text. Reading. MA: Addison-Wesley.

Lee, I. (2004). Error correction in L2 secondary writing classrooms: The case of Hong

Kong. Journal of Second Language Writing, 13, 285-312.

Lee, J., & Musumeci, D. (1988). On hierarchies of reading skills and text types. The


Lee, Y-W., & Sawaki, Y. (2009a). Application of three cognitive diagnosis models to

ESL reading and listening assessments. Language Assessment Quarterly, 6, 239-

263.

Lee, Y-W., & Sawaki, Y. (2009b). Cognitive diagnosis approaches to language

assessment: An overview. Language Assessment Quarterly, 6, 172-189.

Leighton, J. P., & Gierl, M. J. (Ed.). (2007). Cognitive diagnostic assessment for

education: Theory and practices. Cambridge: Cambridge University Press.

Leki, I. (1991). The preferences of ESL students for error correction in college level

230

writing classes. Foreign Language Annals, 24, 203-218.

Leki, I. (2006). “You cannot ignore”: L2 graduate students‟ response to discipline-based

written feedback. In K. Hyland & F. Hyland (Ed.), Feedback in second

language writing: Contexts and issues (pp. 266-285). New York: Cambridge.

Leki, I., & Carson, J. G. (1994). Students‟ perceptions of EAP writing instruction and

writing needs across the disciplines. TESOL Quarterly, 28, 81-101.

Leki, I., Cumming, A., & Silva, T. (2008). A synthesis of research on second language

writing in English. New York, NY: Routledge.

Leśniewska, J. (2006). Collocations and second language use. Studia Linguistica, 123,

95-105.

Linacre, J. M. (2009). A user‟s guide to Facets: Rasch-model computer programs.

Version3.66.0 [Computer software and manual]. Retrieved October 21, 2009,

from www.winsteps.com.

Linnarud, M. (1986). Lexis in composition: A performance analysis of Swedish learners‟

written English. Malmö: CWK Gleerup.

Liskin-Gasparro, J. (1984). The ACTFL guidelines: A historical perspective. In T. V.

Higgs (Ed.), Teaching for proficiency: The organizing principle (pp. 11-42).

Lincolnwood, IL: National Textbook.

Lloyd-Jones, R. (1977). Primary trait scoring. In C. R. Cooper & L. Odell (Ed.),

Evaluating writing (pp. 33-66). New York: National Council of Teachers of

English.

Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really

mean to the raters? Language Testing, 19, 246-276.

Lumley, T. (2005). Assessing second language writing: The raters‟ perspective. Frankfurt:

Peter Lang.

Lunz, M. E., & Stahl, J. A. (1990). Judge severity and consistency across grading periods.

Evaluation and the Health Professions, 13, 425-444.

Luoma, S. (2004). Assessing speaking. Cambridge: Cambridge University Press.

Lynch, B. K. (2001). Rethinking assessment from a critical perspective. Language

Testing, 18, 351-372.

Matthews, M. (1990). The measurement of productive skills: Doubts concerning the

assessment criteria of certain public examinations. ELT Journal, 44, 117-121.

McCarthy, M. (1990). Vocabulary. Oxford: Oxford University Press.

231

McClure, E. (1991). A comparison of lexical strategies in L1 and L2 written English

narratives. Pragmatics and Language Learning, 2, 141-154.

McCulley, G. A. (1985). Writing quality, coherence, and cohesion. Research in the

Teaching of English, 19, 269-282.

McNamara, T. F. (1996). Measuring second language performance. London: Longman.

Messick, S. (1989). Validity. In R. Linn (Ed.), Educational measurement (3rd

ed.) (pp. 13-

103). New York: American Council on Education and Macmillan.

Milanovic, M., Saville, N., & Shuhong, S. (1996). A study of the decision making

behavior of composition markers. In M. Milanovic & N. Saville (Ed.), Studies in

language testing 3: Performance testing, cognition and assessment (pp. 92-111).

Cambridge: Cambridge University Press.

Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2002). Design and analysis in task-

based language assessment, Language Testing, 19, 477-496.

Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of educational

assessments. Measurement: Interdisciplinary Research and Perspectives, 1, 3-62.

Monroe, J. H. (1975). Measuring and enhancing syntactic fluency in French. The French

Review, 48, 1023-1031.

Mullen, K.A. (1977). Using rater judgements in the evaluation of writing proficiency for

nonnative speakers of English. In H. D. Brown, C. A. Yorio., & R. H. Crymes

(Ed.), On TESOL 77: Teaching and learning English as a second language:

Trends in research and practice (pp. 309-320). Washington, D.C.: TESOL.

Myford, C. M., & Wolfe, E. W. (2000). Monitoring sources of variability within the test

of spoken English Assessment System (Research Report No. 00-06). Princeton,

NJ: Educational Testing Service, Center for Performance Assessment.

Myford, C. M., & Wolfe, E. W. (2004a). Detecting and measuring rater effects using

many-facet Rasch measurement: Part I. In Smith, Jr., E. V. & Smith, R. M. (Ed.),

Introduction to Rasch measurement (pp. 460-517). Maple Grove, MN: JAM

Press.

Myford, C. M., & Wolfe, E. W. (2004b). Detecting and measuring rater effects using

many-facet Rasch measurement: Part II. In Smith, Jr., E. V. & Smith, R. M. (Ed.),

Introduction to Rasch measurement (pp. 518-574). Maple Grove, MN: JAM

Press.

Nas, G. (1975). Determining the communicative value of written discourse produced by

L2 learners. Utrecht, The Netherlands: Institute of Applied Linguistics.

Neuner, J. L. (1987). Cohesive ties and chains in good and poor freshman essays.

232


Nichols, P. D. (1994). A framework for developing cognitively diagnostic assessments.

Review of Educational Research, 64, 575-603.

Nichols, P. D., Chipman, S. F., & Brennan, R. L. (Ed.). (1995). Cognitively diagnostic

assessment. NJ: Lawrence Erlbaum.

North, B. (1993). The development of descriptors on scales of language proficiency.

Washington, DC: National Foreign Language Center.

North, B. (1994). Scales of language proficiency: A survey of some existing systems.

Strasbourge: Council of Europe.

North, B. (1995). The development of a common framework scale of descriptors of

language proficiency based on a theory of measurement. System, 23, 445-465.

North, B. (1996). The development of a common framework scale of descriptors of

language proficiency based on a theory of measurement. Unpublished doctoral

dissertation, Thames Valley University.

North, B. (2000). The development of a common framework scale of language

proficiency. Oxford: Peter Lang.

North, B., & Schneider, G. (1998). Scaling descriptors for language proficiency scales.

Language Testing, 15, 217-263.

Omaggio Hadley, A. (1993). Teaching language in context (2nd

Ed.). Boston, Mass.:

Heinle & Heinle.

Pellegrino, J. W., & Chudowsky, N. (2003). The foundations of assessment.

Measurement: Interdisciplinary Research and perspectives, 1, 103-148.

Pellegrino, J. W., Chudowsky, N., & Glaser, R. (2001). Know what students know: The

science and design of educational assessment. Washington, DC: National

Academy Press.

Perkins, K. (1980). Using objective methods of attained writing proficiency to

discriminating among holistic evaluations. TESOL Quarterly, 14, 61-69.

Perkins, K. (1983). On the use of composition scoring techniques, objective measures,

and objective tests to evaluate ESL writing ability. TESOL Quarterly, 17, 651-

671.

Péry-Woodley, M-P. (1991). Writing in LI and L2: Analysing and evaluating learners‟

texts. Language Teaching. 24, 69-83.

Pienemann, M., Johnson, M., & Brindley, G. (1988). Constructing an acquisition-based

233

procedure for second language assessment. Studies in Second Language

Acquisition, 10, 217-243.

Polio, C. (1997). Measures of linguistic accuracy in second language writing research.

Language Learning, 47, 101-143.

Polio, C. (2001). Research methodology in second language writing: The case of text-

based studies. In T. Silva and P. Matsuda. (Ed.) On second language writing (p.

91-116). Mahwah, NJ: Erlbaum.

Pollitt, A., & Murray, N. L. (1996). What raters really pay attention to? In M. Milanovic,

& N. Saville (Ed.), Studies in language testing 3: Performance testing, cognition

and assessment (pp. 74-91). Cambridge: Cambridge University Press.

QSR. (2008). NVivo 8: Getting started. QSR International.

Raimes, A. (1985). What unskilled ESL students do as they write: A classroom study of

composing. TESOL Quarterly, 19, 229-258.

Read, J. (2000). Assessing vocabulary. Cambridge: Cambridge University Press.

Reckase, M. D., & McKinley, R. L. (1991). The discriminating power of items that

measure more than one dimension. Applied Psychological Measurement, 15,

361-373.

Reid, J. (1992). A computer text analysis of four cohesion devices in English discourse

by native and nonnative writers. Journal of Second Language Writing, 1, 79-107.

Roussos, L. A., Stout, W., & Marden, J. (1998). Using new proximity measures with

hierarchical cluster analysis to detect multidimensionality. Journal of

Educational Measurement, 35, 1-30.

Roussos, L. A., Templin, J. L., & Henson, R. A. (2007a). Skills diagnosis using IRT-

based latent class models. Journal of Educational Measurement, 44, 293-311.

Roussos, L. A., DiBello, L. V., Stout, W., Hartz, S. M., Henson, R. A., & Templin, J. L.

(2007b). The fusion model skills diagnosis system. In J. P. Leighton & M. J.

Gierl. (Ed.), Cognitive diagnostic assessment for education: Theory and practice

(pp. 275-318). Cambridge: Cambridge University Press.

Sakyi, A. A. (2000). Validation of holistic scoring for ESL writing assessment: How

raters evaluate compositions. In A. J. Kunnan (Ed.), Fairness and validation in

language assessment: Selected papers from the 19th Language Testing Research

Colloquium (pp. 129-152), Orlando, Florida. Cambridge: Cambridge University

Press.

Sawaki, Y., Kim, H-J., & Gentile, C. (2009). Q-Matrix construction: Defining the link

between constructs and test items in large-scale reading and listening

234

comprehension assessments. Language Assessment Quarterly, 6, 190-209.

Scardamalia, M., & Bereiter, C. (1987). Knowledge telling and knowledge transforming

in written composition. In S. Rosenberge (Ed.), Advances in applied

psycholinguistics, Volume 2: Reading, writing, and language learning (pp. 142-

175). Cambridge: Cambridge University Press.

Schneider, M., & Connor, U. (1990). Analyzing topical structure in ESL essays. Studies

in Second Language Acquisition, 12, 411-427.

Shaw, P., & Liu, E. (1998). What develops in the development of second-language

writing? Applied Linguistics, 19, 225-254.

Shepard, L. A. (1993). Evaluating test validity. In L. Darling-Hammond (Ed.), Review of

research in education (pp. 405-450). Washington, D. C.: Educational Research

Association.

Shohamy, E. (1992). Beyond proficiency testing: A diagnostic feedback testing model for

assessing foreign language learning. The Modern Language Journal, 76, 513-

521.

Silva, T. (1990). Second language composition instruction: Developments, issues, and

directions in ESL. In B. Kroll (Ed.), Second language writing (pp. 11-23).


Silva, T. (1992). L1 vs L2 writing: ESL graduate students‟ perceptions. TESL Canada

Journal, 10, 27-47.

Smith, D. (2000). Rater judgments in the direct assessment of competency-based second

language writing ability. In G. Brindley (Ed.), Studies in immigrant English

language assessment (pp. 159-189). Sydney, Australia: National Centre for

English Language Teaching and Research, Macquarie University.

Smith, P. C., & Kendall, J. M. (1963). Retranslation of expectations: An approach to the

construction of unambiguous anchors for rating scales. Journal of Applied

Psychology, 47, 149-155.

Snow, R. E., & Lohman, D. F. (1989). Implications of cognitive psychology for

educational measurement. In R. L. Linn (Ed.), Educational Measurement (3rd

ed.,

pp. 263-332). New York: Macmillan.

Sokal, R. R., & Michener, C. D. (1958). A statistical method for evaluating systematic

relationships. University of Kansas Science Bulletin, 38, 1409-1438.

Sperber, D., & Wilson, D. (1986). Relevance. Oxford: Harvard University Press.

Sperling, M. (1996). Revisiting the writing-speaking connection: Challenges for research

on writing and writing instruction. Review of Educational Research, 66, 53-86.

235

Spolsky, B. (1990). Social aspects of individual assessment. In J. de Jong & D. K.

Stevenson (Ed.), Individualizing the assessment of language abilities (pp. 3-15).

Avon: Multilingual Matters.

Spolsky, B. (1992). The gentle art of diagnostic testing revisited. In E. Shohamy & A. R.

Walton (Ed.), Language assessment for feedback: Testing and other strategies

(pp. 29-41). Dubuque, IA: Kendall/Hunt.

Stout, W., Froelich, A., & Gao, F. (2001). Using resampling methods to produce an

improved DIMTEST procedure. In A. Boomsma, M. A. J. van Duijn, & T. A. B.

Snijders (Ed.), Essays on item response theory (pp. 357-376). New York:

Springer-Verlag.

Straub, R. (1997). Students‟ reactions to teacher comments: An exploratory study.


Swales, J. (1990). Genre analysis: English in academic and research setting. Cambridge:

Cambridge University Press.

Sympson, J. B. (1977). A model for testing with multidimensional items. In Weiss, D. J.

(Ed.), Proceedings of the 1977 computerized adaptive testing conference (pp. 82-

88). University of Minnesota, Department of Psychology, Psychometric Methods

Program, Minneapolis.

Tatsuoka, K. K. (1983). Rule space: An approach for dealing with misconceptions based

on item response theory. Journal of Educational Measurement, 20, 345-354.

Tatsuoka, K. K. (1990). Toward an integration of item-response theory and cognitive

error diagnosis. In N. Fredrickson, R. L. Glaser, A. M. Lesgold, & M. G. Shafto

(Ed.), Diagnostic monitoring of skills and knowledge acquisition (pp. 453-488).

Hillsdale, NJ: Erlbaum.

Tatsuoka, K. K. (1993). Item construction and psychometric models appropriate for

constructed responses. In R. E. Bennett & W. C. Ward (Ed.), Construction

versus choice in cognitive measurement (pp. 107-133). Hillsdale, NJ: Erlbaum.

Tatsuoka, K. K. (1995). Architecture of knowledge structures and cognitive diagnosis: A

statistical pattern recognition and classification approach. In P. D. Nichols, S. F.

Chipman, & R. L. Brennan (Ed.), Cognitively diagnostic assessment (pp. 327-

359). Hillsdale, NJ: Erlbaum.

Taylor, J. (1993). Prepositions: Patterns of polysemization and strategies of

disambiguation. In C. Zelinsky-Wibbelt (Ed.), Natural language processing (pp.

151-175). The Hague: Mouton de Gruyter.

Templin, J. L., & Henson, R. A. (2006). Measurement of psychological disorders using

cognitive diagnosis models. Psychological Methods, 11, 287-305.

236

Thurstone, L. L. (1959). The measurement of values. Chicago: University of Chicago

Press.

Tierney, R., & Mosenthal, J. (1983). Cohesion and textual coherence. Research in the

Teaching of English, 17, 215-229.

Toulmin, S. E. (2003). The uses of argument (Updated ed.). Cambridge: Cambridge

University of Press.

Turner, C. E. (2000). Listening to the voices of rating scale developers: Identifying

salient features for second language performance assessment. The Canadian

Modern Language Review, 56, 555-584.

Turner, C. E., & Upshur, J. A. (1996). Developing rating scales for the assessment of

second language performance. In G. Wigglesworth & C. Elder (Ed.), The

language testing cycle: From inception to washback (pp. 55-79). Melbourne:

Australian Review of Applied Linguistics.

Turner, C. E., & Upshur, J. A. (2002). Rating scales derived from student samples:

Effects of the scale maker and the student sample on scale content and student

scores. TESOL Quarterly, 36, 49-70.

Underhill, N. (1987). Testing spoken language: A handbook of oral testing techniques.


University of Cambridge, British Council, & IELTS Australia. (2007). IELTS Handbook

2007. University of Cambridge, British Council, and IELTS Australia: UK.

University of Michigan. (2003). Michigan English Language Assessment Battery:

Technical Manual 2003. Testing and Certificate Division, English Language

Institute, University of Michigan, Ann Arbor.

Upshur, J. A., & Turner, C. E. (1995). Constructing rating scales for second language

tests. ELT Journal, 49, 3-12.

Upshur, J. A., & Turner, C. E. (1999). Systematic effects in the rating of second-language

speaking ability: Test method and learner discourse. Language Testing, 16, 82-

111.

Vande Kopple, W. J. (1985). Some exploratory discourse on metadiscourse. College


Vann, R. J. (1979). Oral and written syntactic relationships in second language learning.

In C. Yorio, K. Perkinds, & J. Schachter (Ed.), On TESOL 79: The learner in

focus (pp. 322-329). Washington, DC: TESOL.

Vaughan, C. (1991). Holistic assessment: What goes on in the rater's mind? In L. Hamp-

Lyons (Ed.), Assessing second language writing in academic contexts (pp. 111-

237

125). Norwood, New Jersey: Ablex Publishing Corporation.

von Davier, M. (2005). A General diagnostic model applied to language testing data.

ETS Research Report RR-05-16. Princeton, NJ: ETS.

Waller, T. (1993). Characteristics of near-native proficiency in writing. In H. Ringbom

(Ed.), Near-native proficiency in English (pp. 183-293). Å bo: Å bo Akademi

University.

Watanabe, Y. (2004). Methodology in washback studies. In L. Cheng, Y. Watanabe, & A.

Curtis (Ed.), Washback in language testing: Research contexts and methods (pp.

19-36). Mahwah, NJ: Lawrence Erlbaum Associates.

Watson Todd, R. (1998). Topic-based analysis of classroom discourse. System, 26, 303-

318.

Watson Todd, R., Thienpermpool, P., & Keyuravong, S. (2004). Measuring the coherence

of writing using topic-based analysis. Assessing Writing, 9, 85-104.

Weigle, S. C. (2002). Assessing writing. New York: Cambridge.

White, E. M. (1985). Teaching and assessing writing. San Francisco: Jossey-Bass.

Witte, S. (1983a). Topical structure analysis and revision: An exploratory study. College


Witte, S. (1983b). Topical structure and writing quality: Some possible text-based

explanation of readers‟ judgments of student writing. Visible Language, 17, 177-

205.

Witte, S., & Faigley, L. (1981). Cohesion, coherence and writing quality. College


Wolfe, E. W., Chiu, C. W. T., & Myford, C. M. (1999). The manifestation of common

rater effects in multi-faceted Rasch analyses (Monograph Series No. 97-20).

Princeton, NJ: Educational Testing Service, Center for Performance Assessment.

Wolfe-Quintero, K., Inagaki, S., & Kim, H-Y. (1998). Second language development in

writing: Measures of fluency, accuracy and complexity. Technical Report No. 17.

Honolulu, HI: University of Hawai‟i Press.

Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch

Measurement: Transactions of the Rasch Measurement SIG, 8, 370.

Yamamoto, K., & Gitomer, D. (1993). Application of a HYBRID model to a test of

cognitive skill representation. In N. Frederiksen, R. Mislevy, & I. Beijar (Ed.),

Test theory for new generation of tests. Hillsdale, NJ: LEA.

238

Yule, G. (1985). The study of language. Cambridge: Cambridge University Press.

Zhang, S. (1995). Re-examining the affective advantages of peer feedback in the ESL

writing class. Journal of Second Language Writing, 4, 209-222.

Zhang, J., & Stout, W. (1999). The theoretical DETECT index of dimensionality and its

application to approximate simple structure. Psychometrika, 64, 213-249.

239

APPENDIX A

DEFINITIONS OF KEY TERMS

Analytic scoring: a type of marking procedure in which raters award separate

subscores to diverse features of test performance.

Classical test theory (CTT): a measurement theory that assumes that an

examinee‟s true test score is obtained when no errors exist in measurement.

Cognitive Diagnostic Assessment (CDA): a type of assessment that measures the

specific cognitive knowledge structures of examinees in order to provide detailed

diagnostic information about their strengths and weaknesses.

Coherence: an aspect of discourse competence associated with organizing ideas

in spoken or written text.

Cohesion: an aspect of discourse competence associated with explicit linguistic

cues of semantic relationships in spoken or written text.

Construct: the ability or trait to be measured by a test.

DIALANG: an online diagnostic language assessment system that assesses five

aspects of language knowledge (reading, listening, writing, grammar and vocabulary) in

14 European languages.

Dimensionality: a measurement property of item responses derived from a test.

Discourse analysis: a linguistic approach to analyzing spoken or written text.

Formative assessment: a type of assessment intended to provide students and

teachers with immediate feedback that can improve teaching and learning during the

period of instruction.

Holistic scoring: a type of marking procedure in which raters award a single

composite score to the overall quality of test performance.

Item response theory (IRT): a measurement theory that concerns the estimation

of examinees‟ latent ability while taking item responses into account.

Objective measures: a means of quantifying observable characteristics or

qualities of speaking or writing performance by tallying the frequencies or calculating the

ratios of certain linguistic features that occur in a spoken or written corpus.

Organizational knowledge: an ability to control the structure of spoken or written

text using grammatical and textual knowledge.

Primary trait scoring: a type of marking procedure in which raters focus on the

particular writing traits in a writing task that are considered important within a specific

240

context.

Profiling: a way of reporting test results in the form of comprehensive and

accessible descriptions.

Q-matrix: an incidence matrix that represents the relationship between skills and

items in a test.

Skill: an aspect of underlying ability characterizing the construct to be measured.

Sociolinguistic knowledge: the ability to produce and interpret spoken and

written text that is appropriate in a particular language use context.

Summative assessment: a type of assessment intended to provide assessment

outcomes to internal and external stakeholders at the end of a period of instruction, for

accountability purposes.

Test of English as a Foreign Language (TOEFL): a standardized English

proficiency test intended to assess examinees‟ ability to communicate effectively in

English in an academic context.

Think-aloud protocol: a research method involving participants thinking aloud

while they are performing a given task.

Washback: the positive or negative effect of testing on instruction.

241

APPENDIX B

ESL TEACHER PROFILE

Name Age Gender Postgrad

studies

Prof.

certificate

ESL

teaching

(yrs)

ESL writing

teaching (yrs)

Familiar with

ESL writing

Training in

ESL writing

assessment

Competent in

assessing

ESL writing

Assessment

experience

(yrs)

Study

participation

(phase)

Angelina 30-39 F None TESL 7 6 Extremely No Very 5 2 (P, M)*

Ann above 50 F MA TESL 25 25 Extremely Yes Very 9 1 & 2 (P, M)

Beth 40-49 F None TESL 15 15 Extremely Yes Very 4 1 & 2 (P, M)

Brad 30-39 M MA None 10 7 Very Yes Very 3 2 (P, M)

Erin 30-39 F MA TESL 10 10 Extremely Yes Extremely 5 2 (M)

Esther 40-49 F MA TESL 7 5 Extremely Yes Very 2 1 & 2 (P)

George 40-49 M PhD TESL 14 14 Very No Very 10 1

Greg 30-39 M MA TESL 3 2 Extremely No Extremely 1 2 (M)

James 30-39 M MA None 5 2 Very No Very 1 1

Judy 40-49 F MA None 10 8 Extremely Yes Very 8 1

Kara 40-49 F None TESL 11 3 Extremely Yes Very 3 2 (M)

Sarah 40-49 F MA TESL 12 5 Extremely Yes Very 2 1 & 2 (M)

Shelley above 50 F MA TESL 7 2 Extremely Yes Extremely 2 1

Susan 40-49 F MA TESL 12 10 Extremely Yes Very 6 2 (P, M)

Tim above 50 M None TESL 6 6 Extremely Yes Extremely 4 1

Tom 30-39 M None TESL 9 9 Extremely No Very 2 2 (P, M)

Note. “P” refers to the pilot study, while “M” refers to the main study.

242

APPENDIX C

GUIDELINES FOR A THINK-ALOUD SESSION

Warm-up

Thank you for your interest in this study. I am conducting a study that examines

an effective way to provide diagnostic feedback to ESL writers on a timed essay test. Due

to the complex nature of second language writing, ESL learners need to be well informed

about the strengths and weaknesses in their writing. Despite the interest in and need for a

diagnostic approach in ESL writing instruction and assessment, little is known about

what kinds of cognitive and linguistic skills or strategies must be diagnosed, and in what

ways. It is thus critical to have a good understanding of what ESL teachers think while

providing diagnostic feedback on students‟ ESL writing. This information will ultimately

enable detailed diagnostic description to be tailored to and made available to individual

ESL writers.

In this session, I would like to gather information about the ways in which you

provide diagnostic feedback on ESL timed essays. In particular, I am interested in the

writing skills and strategies that you attend to while assessing and providing feedback on

ESL essays. I will explain in greater detail what I would like you to do during this

session. I will give you a package of 10 essays and a copy of essay prompts. These

essays were written by adult ESL learners with a wide range of English proficiency in a

large-scale testing setting within 30 minutes. The two essay prompts were:

(a) Do you agree or disagree with the following statement? It is more important to

choose to study subjects you are interested in than to choose subjects to prepare

for a job or career. Use specific reasons and examples to support your answer.

(b) Do you agree or disagree with the following statement? In today's world, the

ability to cooperate well with others is far more important than it was in the past.

Use specific reasons and examples to support your answer.

Five essays were written on each prompt. Once you have a general understanding

of the essays and prompts, I would like to ask you to say aloud what you are thinking

about as you are providing diagnostic feedback on each essay. To facilitate the thinking-

aloud, you may want to write something down (e.g., comments, corrections, etc). If you

do so, say aloud what you are writing. You may read the essays silently or aloud,

243

according to what best suits you. If you are reading silently, indicate which part of the

essay you are reading. I would also like to ask you to assign to it a mark from 1 to 5, in

which 5 is the most proficient and 1 is the least proficient. Please say in as much detail as

possible what you are thinking about while you provide feedback on the essays. Do you

understand what I want you to do?

Great! The most important thing in doing this task is to think aloud constantly

while you are reading and providing feedback on the essay. I don‟t want you to plan out

what you are going to say or explain to me what you are saying. Just act as if you are

alone in the room speaking to yourself. I would like to emphasize that it is important for

you to keep talking without a long interval of silence. If you are silent for any length of

time, I will remind you to keep thinking aloud. Is this clear to you?

Before we proceed with the main task, I will give you a practice exercise to help

you familiarize yourself with the think-aloud procedure. I would like you to multiply

these two numbers and tell me what you are thinking to produce an answer.

“What is the result of 12 x 21?”

Good! Now, I am going to give you this package. In it, you will find 10 essays

and a copy of three essay prompts. (Give the material to the teacher.) Do you have any

questions about the procedure before you begin?

Intervention Prompts

What are some of the issues that come to your mind regarding this essay?

Okay, now tell me what you are thinking as you are reading and providing feedback

on the essay.

What other things are you thinking? (Without any intervention, let the teacher finish

thinking aloud.)

Keep talking.

While you were thinking aloud, you said XXX. Can you elaborate a bit more on your

thought process?

244

Follow-up Interview Questions

Did you have any problems with thinking aloud your thoughts?

What skills or strategies do you think are important in ESL writing?

How have you provided your students with feedback on their ESL writing?

What skills or strategies do you teach to improve students‟ ESL writing proficiency?

Background Questionnaire

Before closing this session, I would like to collect your background information.

Your answers to this questionnaire will help me better understand your teaching and

evaluation methods in ESL academic writing. All information will remain confidential,

and will be used for research purposes only. Do you have any questions? (Give a

questionnaire to the teacher).

I. Personal Profile

1. Age: 20 – 29 30 – 39 40 – 49 above 50

2. Gender: Male Female

3. First language(s):

If your first language is not English, please specify the other language(s) you speak at

home and at the workplace:

(a) at home: (b) at the workplace:

4. Educational background: (Please specify subject areas)

B.A. in

M.A. in

Ph.D. in

Professional Certificate in

Other training related to assessment and ESL writing

5. Current professional position:

245

II. Professional Teaching Experience

6. How many years have you taught ESL to non-native English speakers?

7. In what type of language institute have you taught ESL?

Private language institute

College/University-bound language institute

College/University

Other, Specify:

8. Please specify course titles you have taught in the past or that you currently teach:

9. How many years have you taught ESL writing or ESL academic writing to non-native

English speakers?

(a) ESL writing: (b) ESL academic writing:

10. Do you have other professional writing experiences than teaching ESL (academic)

writing courses?

If you yes, specify:

III. Evaluation of ESL Academic Writing

11. How familiar are you with the written English of non-native English speakers?

A little Quite Very Extremely

12. How competent are you in assessing academic composition of non-native English

speakers?


13. Have you ever been trained as an assessor of ESL academic writing?

Yes No

If yes, specify the year(s) that you received training and the number of training hours

246

completed (i.e., dates):

14. How many years and in what context have you assessed ESL academic writing?

15. If you have assessment experiences that might have influenced your assessment in

this study, please specify:

Closing Statement

Your think-aloud reports and answers to the interview/questionnaire have provided

valuable information about what writing skills or strategies need to be diagnosed in ESL

academic writing. Thank you so much for your interest and participation in this study.

247

APPENDIX D

TEACHER QUESTIONNAIRE

APPENDIX D-1

TEACHER QUESTIONNAIRE I (FOR THE PILOT STUDY)

Your answers to the following questions will help me better understand your evaluations

of the EDD checklist. All information will remain confidential, and will be used for

research purposes only.

I. Personal Profile

1. Age: 20 – 29 30 – 39 40 – 49 above 50







B.A. in

M.A. in

Ph.D. in






248




College/University

Other, Specify:



English speakers?



writing courses?





12. How competent are you in assessing the academic compositions of non-native

English speakers?



Yes No



249


IV. Evaluation of the EDD Checklist

15. When you marked the given essays, how many times did you read them, on average?

Once Twice Three times More than three times

16. How much did you like the EDD checklist when marking the essays?


17. Were the EDD descriptors clearly understood?

Yes No

If no, specify descriptors that were ambiguous or not clearly understood:

18. Were the EDD descriptors redundant?

Yes No

If yes, specify descriptors that were redundant:

19. Were the EDD descriptors useful?

Yes No

If no, specify descriptors that were useless:

250

20. Were the EDD descriptors relevant to ESL academic writing?

Yes No

If no, specify descriptors that were irrelevant to ESL academic writing:

21. Do you think that the EDD checklist is comprehensive enough to capture all

instances of ESL academic writing?

Yes No

If no, specify areas that the EDD checklist does not describe:

22. Was the EDD checklist conducive to making a binary-choice?

Yes No

If no, explain why the EDD checklist was not easy to making a binary-choice:

23. Were there particular descriptors that you think most or least important in developing

students‟ ESL academic writing?

251

24. What do you think are the EDD checklist‟s strengths?

25. What do you think are the EDD checklist‟s weaknesses?

26. Do you think that the EDD checklist provides useful diagnostic information about the

strengths and weaknesses of students‟ ESL academic writing?

Thank you for your time!

252

APPENDIX D-2

TEACHER QUESTIONNAIRE II (FOR THE MAIN STUDY)

Your answers to the following questions will help me better understand your evaluations

of the EDD checklist. All information will remain confidential, and will be used for

research purposes only.

I. Personal Profile

1. Age: 20 – 29 30 – 39 40 – 49 above 50







B.A. in

M.A. in

Ph.D. in









253

College/University

Other, Specify:



English speakers?



writing courses?





12. How competent are you in assessing the academic compositions of non-native

English speakers?



Yes No




254

15. What tools do you usually use to evaluate ESL academic writing?

Anecdotal notes (use word descriptions)

Checklists

Rating scales

Marks, scores (use numbers)

Other, specify:

16. Have you ever used a rating scale to evaluate ESL academic writing in your

classroom evaluation?

Yes No

If yes, what kind of rating scale have you used?

Holistic rating scales

Analytic rating scales

Empirical rating scales

Other, specify:

If yes, how did you like the rating scale?

17. What evaluation methods do you usually use to diagnose your students‟ ESL

academic writing?

18. How often have you diagnosed your students‟ progress in ESL academic writing?

Once a week Once in two weeks

255

Once a month Once in a term

19. Do you consider diagnostic results when you teach?

Yes No

If no, explain why you do not consider diagnostic results when you teach.

IV. Evaluation of the EDD Checklist

20. When you marked the given essays, how many times did you read them, on average?

Once Twice Three times More than three times

21. How much did you like the EDD checklist when marking the essays?




If there were descriptors that were ambiguous or not clearly understood no, please

specify:


No A little Quite Very

If there were descriptors that were redundant, please specify:

256



If there were descriptors that were useless, please specify:



If there were descriptors that were irrelevant to ESL academic writing, please specify:

26. Do you think that the EDD checklist is comprehensive enough to capture all

instances of ESL academic writing?


If no, specify the areas that the EDD checklist does not describe:



If no, explain why the EDD checklist was not easy to making a binary-choice:

257



Thank you for your time!

258

APPENDIX E

GUIDING INTERVIEW QUESTIONS FOR TEACHERS

APPENDIX E-1

GUIDING INTERVIEW QUESTIONS FOR TEACHERS (FOR THE PILOT STUDY)





5. Do you think that the EDD checklist is comprehensive enough to capture all instances





8. What do you think are the EDD checklist‟s strengths?

9. What do you think are the EDD checklist‟s weaknesses?

10. Do you think that the EDD checklist provides useful diagnostic information about the

strengths and weaknesses of students‟ ESL academic writing?

11. Would you elaborate the reasons why you liked (or did not like) the EDD checklist

when assessing essays?

259

APPENDIX E-2

GUIDING INTERVIEW QUESTIONS FOR TEACHERS (FOR THE MAIN STUDY)

1. Why did you find the EDD checklist useful (or useless)?

2. Why do you think that the EDD checklist provides (or does not provide) useful

diagnostic information about the strengths and weaknesses of students‟ ESL academic

writing?

3. In what ways do you think that the diagnostic information provided by the EDD

checklist will (or will not) be useful for classroom instruction and assessment?

4. In what ways do you think that the diagnostic information provided by the EDD

checklist will (or will not) improve the way you teach ESL academic writing?

5. If you have any positive or negative comments about the use of the EDD checklist,

please tell me.

260

APPENDIX F

TEXTUAL CHARACTERISTICS OF THE THREE ESSAY SETS

Table F-1

Characteristics of Essay Set 1

Essay

number

ETS

score

No. of

paragraphs

No. of

words

No. of

word types

Percentage

of K1

words (%)

Percentage

of K2

words (%)

Percentage

of AWL

words (%)

Lexical

density

1015 3 10 344 115 91.57 2.62 2.33 0.39

1025 3 4 396 153 87.88 4.55 5.56 0.49

1034 2 4 156 94 93.59 1.92 1.92 0.39

1128 4 4 383 200 86.42 2.87 4.44 0.50

1176 5 6 448 218 87.05 4.02 6.25 0.52

2013 2 4 332 122 94.88 1.20 3.01 0.44

2045 3 4 322 129 89.13 2.48 5.59 0.50

2063 3 4 306 138 92.16 2.29 3.59 0.38

2124 5 5 586 232 83.62 3.07 8.53 0.52

2236 4 2 553 241 86.26 3.98 6.15 0.47

Mean 3.4 4.7 382.6 164.2 89.26 2.9 4.74 0.46

261

Table F-2


Essay

number

ETS

score

No. of

paragraphs

No. of

words

No. of

word types

Percentage

of K1

words (%)

Percentage

of K2

words (%)

Percentage

of AWL

words (%)

Lexical

density

1030 3 5 299 142 94.31 2.34 1.34 0.43

1057 2 4 173 102 84.39 5.20 6.94 0.51

1112 4 6 518 229 84.17 3.86 7.92 0.46

1135 4 4 547 205 89.76 2.38 5.48 0.46

1160 5 3 421 230 85.51 3.56 7.60 0.48

2019 2 11 269 134 84.39 8.18 2.97 0.49

2074 4 8 358 199 73.46 3.07 13.69 0.55

2078 3 5 332 148 91.87 4.22 1.20 0.45

2122 4 4 363 163 87.33 3.86 7.71 0.52

2134 5 5 510 275 76.86 2.75 6.67 0.55

Mean 3.6 5.5 379 182.7 85.21 3.94 6.15 0.49

262

Table F-3


Essay

number

ETS

score

No. of

paragraphs

No. of

words

No. of

word types

Percentage

of K1

words (%)

Percentage

of K2

words (%)

Percentage

of AWL

words (%)

Lexical

density

1001 1 4 65 44 87.69 3.08 4.62 0.45

1004 3 3 268 100 92.16 1.49 2.99 0.42

1152 4 1 298 155 88.59 2.35 6.38 0.49

1183 5 5 367 198 88.56 2.18 4.63 0.51

1205 2 5 212 92 87.74 0.47 7.08 0.47

2039 2 3 220 112 92.27 3.64 4.09 0.53

2042 3 4 357 163 89.92 5.04 4.20 0.47

2144 5 5 411 205 82.73 4.38 9.49 0.46

2163 4 4 322 147 86.34 1.86 9.32 0.54

2237 1 2 84 48 90.48 1.19 8.33 0.55

Mean 3.0 3.6 260.4 126.4 88.65 2.57 6.11 0.49

263

APPENDIX G

ORDER OF ESSAYS IN EACH SET

Sequence

of essays

Essay Set 1 Essay Set 2 Essay Set 3

Ann Shelley Sarah James Beth George Judy Tim Esther

1 1025 2063 1015 1135 2078 1160 1205 2237 1152

2 1015 2236 1034 1030 2122 1135 1152 2039 1001

3 1128 2013 1025 1057 2134 1057 1001 2042 1004

4 1034 2124 1128 1112 2074 1112 1183 2163 1205

5 1176 2045 1176 1160 2019 1030 1004 2144 1183

6 2013 1034 2124 2074 1057 2019 2144 1183 2039

7 2124 1128 2045 2122 1160 2134 2042 1152 2144

8 2063 1015 2063 2134 1112 2078 2237 1004 2237

9 2045 1176 2236 2078 1030 2074 2039 1205 2042

10 2236 1025 2013 2019 1135 2122 2163 1001 2163

264

APPENDIX H

EXCERPTS FROM TEACHER THINK-ALOUD VERBAL TRANSCRIPTS

Teacher Name: George

[Essay # 1160]

So I‟m just reading it now, and I‟ve noticed a couple of spelling mistakes that just stick

out -- I‟m looking at the organization of it which I normally do, and get the layout of the

ideas, because that‟s always what I look at first. This student is asking a question, I like

that because it hooks the reader, at the beginning of the introductory paragraph. Ok, I‟m

looking at the introductory paragraph and see a couple questions which are good, because

I think they hook the reader. I always advise students to do that but the challenge is, you

know because it's a very short essay, it‟s only three paragraphs, I don‟t see any sort of

overriding thesis statement or no main, um, statement, outlying his or her argument as to

what he‟s going to say, so I see that as a bit of a weakness in this introductory paragraph.

There is a spelling mistake with dilemma… Okay, and I see a big run-on sentence in the

middle of the paragraph that is distracting me. This is one thing I find challenging with

writing is form versus… you know, if the reader gets distracted by your writing, then the

message is lost, that‟s one thing I would work on with this student, is trying to make the

sentences crisp and clear, often I prefer shorter sentences, I mean, obviously mix the

sentence structure but this is losing me so I would make a note of this middle sentence,

which I just find too long and verbose. This is some good writing. I mean good sentence

structure in here.

I am just reading it again. So, I had a look. So this first paragraph this student has some

good ideas. There‟s a nice, the writing is quite proficient, sort of clear vocabulary that‟s

used and a lot of concrete sentence structure and sophisticated phrasing, I love the second

sentence, on the one hand we all know that education serves a social purpose: we study

in order to acquire the skills and knowledge that will help us perform well in the future,

in a working environment-- what I probably recommend this student do is reorganize,

move some of the sentences in the second paragraph into the introductory paragraph to

keep it framed, because the framing of the paper is important, the visual layout of it, so

the introductory paragraph has the hook for the reader he‟s asking the questions but then

the writer needs to answer those questions quickly and make a statement of his or her

opinion. And I think the point in here is just a matter of reorganizing it, social agents,

third paragraph, sophisticated concept… Okay there‟s some good thoughts in here, I just

read the last paragraph, I‟m thinking, there‟s some good thoughts, it skips around a little

bit and doesn‟t have a strong focus. So what I‟d recommend because the supporting

sentence here about the French students, the last sentence in the second paragraph, large

number of French students, there is pluralization, too, and massive failures of the exam,

making the job much harder, for the drop-outs, so I think what I‟d probably recommend

this writer do, the writing is generally quite good, quite proficient, there‟s not really

many problems, careers on the last page, a spelling error then, have to choose between,

265

so the to again -- the feedback I‟d want to give this writer is the idea of focus, so to have

a stronger focus and to front-load the writing, the focus is in English academic, English

in general, we tend to front-load ideas, so put the most important idea up front, so I liked

the introductory paragraph as it‟s written but it needs some summary of what his or her

main point is under that, and I think the raw material is in the second paragraph, and the

third paragraph can be synthesized, so we need a strong thesis statement, synthesized in

the first paragraph, and then, um, then making sure that following -- the two remaining

paragraphs, one main body paragraph and a concluding paragraph, that the supporting

paragraph, the body really has enough support to prove the statement that student writes

in the introductory statement, and then the concluding statement, reinforces what his or

her argument is and leads the readers, yeah you know I agree with that, so I think this

paper what I‟d really focus on is form and minor editing also, trying to get the writer to

review his or her work, to look for a series of spelling mistakes, a number of spelling

mistakes -- the grammar is great, and just some minor issues along those lines, I mean the

writing is very proficient and powerful, just needs reorganization of ideas.

Follow-up Interview Questions

Researcher: You said the grammar is great. Can you be a bit more specific?

George: Well it is, the sentences, using complex sentence structure, I shouldn‟t presume

it‟s a he but the student is using complex sentence structure, certain multiple, um, sort of

subordinate clauses that a lot of nice connecting phrases like in recent years, for example,

with this in mind, and that shows, this person has that cohesiveness in the writing, they

can connect the material quite easily and make it flow quite nicely, what I probably

suggest is that there was that bit of a run-on sentence in that second paragraph, and this is

where editing comes in to keep the focus clear, once the reader starts getting lost in the

writing in the form, then the message disappears. So what I‟d recommend, the writer put

a colon here, anywhere, with semi-colons and colons I‟ll often see if there‟s a way to

make it into two sentences, make it more concise and efficient, and clear, but what I can

see, looking through it quickly, the grammar seemed quite sophisticated, and the

grammar accuracy seemed quite good.

[Essay # 1135]

I just read the first paragraph, I noticed that there‟s -- the writing is interesting because I

just finished reading the more sophisticated writing, and this is also quite good from what

I see the grammar doesn‟t have any major issues, but there are some sort of sentence

structure issues that I think will be important to address like this last question, but when

it comes to address I‟d say it‟s the first one that‟s important, and that‟s where I

recommend dividing it into two periods, leaving out the but -- saying when it comes to

this question I think it's' hard to say which one is important, period, people should

consider both these. Instead of two things because things is so general, state specifically

what you‟re talking of and use phrases from the prompt. Because it just makes it more

clear, using a word like thing is so vague and make their own choose, here we have sort

of a word form issue, beleif -- spelling mistake -- okay and the subject verb agreement

266

here, someone totally do„es‟ not pay, so the second paragraph -- I read the first and

second paragraphs and then I skimmed to see how the paper is organized and looking at

how the paragraphs begin, the first few phrases in the paragraphs, sometimes what I do to

see the organization and layout of the writing, I think the student has some good

techniques, sort of starting out with my dream, it interests the reader, interesting narrative,

personal anecdote, personal, storytelling approach, and then the framing of the second

paragraph to begin with is great in the second place, to sum up, so that shows the reader

the different steps in the argument, first second and the conclusion. So that is a good

sense of the importance of organization but there are some grammatical issues, subject

verb agreement, someone totally does not pay attention… if someone „is‟ very interested,

so again someone being singular, seems a bit of a problem, the student has to know that

someone is singular and she/he decides so again subject verb agreement, so we have

grammatical accuracy issues, subject verb agreements, some typos, subjests, from this

case, people can see with interest, people may not do well from subjects one day, he or

she may not go back to his or her interesting things and just the last sentence in the

second paragraph, he or she may go, return, or um, yeah so there‟s some awkwardness

around that, there‟s not idiomatic writing -- let me read the second to last paragraph --

since that -- some issue there, although it may appear to be -- so

Okay, there‟s some problems in the second paragraph, some grammatical structural

problems, so what I want to say is I‟m also interested in designing, gerund versus

infinitives, so it's' not boring to me, now when I have time to kill, I will play soccer with

my friends and some phrases here, finding for designing -- so this I don‟t really follow, I

presume she or he -- she probably, the job search… so, issues of word forms, I think,

nouns versus verbs, gerunds versus infinitives, choosing a design in course, and the job

search, vocabulary usage, -- so those people can see the job factors may have an

important effect on your choice -- again another typo… another subject verb agreement,

everything „has‟ their flaws and benefits just like a coin… interesting, okay so I think the

organization of… this paper is quite good, I think the form of the um, layout of it, is quite

promising, it‟s a nice introductory paragraph with a brief thesis statement, when it comes

to this question it‟s hard to say which is important, people should consider both these two

things and carefully make their own chioce. I think that‟s a nice clear statement, and the

student has some good strategies using these personal experiences to reinforce points,

makes it memorable for the reader, there is some, major, spelling mistakes, throughout…

and there are also some issues with subject verb agreement and some issues with

sentence structure… um, and um, verb forms, and word forms… but I mean they have a

sense of proper sentence structure. What I would recommend with this student is to focus

on these specific areas, like subject verb agreement, word forms, they have the flow of

the writing it‟s the accuracy that has to be improved but I think it‟s quite good it just

needs some fine turning and editing in those areas.

267

APPENDIX I

THE EDD CHECKLIST

Essay number:

1. This essay answers the question. Yes No


Yes No


Yes No

4. This essay contains a clear thesis statement. Yes No

5. The main arguments of this essay are strong. Yes No

6. There are enough supporting ideas and examples in this essay. Yes No


Yes No


Yes No


Yes No


Yes No

11. Each paragraph presents one distinct and unified idea. Yes No

12. Each paragraph is connected to the rest of the essay. Yes No

13. Ideas are developed or expanded well throughout each paragraph. Yes No

14. Transition devices are used effectively. Yes No


Yes No

16. This essay demonstrates an understanding of English word order. Yes No

17. This essay contains few sentence fragments. Yes No

18. This essay contains few run-on sentences or comma splices. Yes No


Yes No

20. Verb tenses are used appropriately. Yes No

21. There is consistent subject-verb agreement. Yes No

22. Singular and plural nouns are used appropriately. Yes No

23. Prepositions are used appropriately. Yes No

24. Articles are used appropriately. Yes No

25. Pronouns agree with referents. Yes No

268

Appendix I (Continued)

26. Sophisticated or advanced vocabulary is used. Yes No

27. A wide range of vocabulary is used. Yes No


Yes No

29. This essay demonstrates facility with appropriate collocations. Yes No


Yes No

31. Words are spelled correctly. Yes No

32. Punctuation marks are used appropriately. Yes No

33. Capital letters are used appropriately. Yes No

34. This essay contains appropriate indentation. Yes No

35. Appropriate tone and register are used throughout the essay. Yes No

269

APPENDIX J

THE EDD CHECKLIST WITH CONFIDENCE LEVEL

Essay number:

1. This essay answers the question. Yes No ( %)


Yes No ( %)


Yes No ( %)

4. This essay contains a clear thesis statement. Yes No ( %)

5. The main arguments of this essay are strong. Yes No ( %)


Yes No ( %)


Yes No ( %)


Yes No ( %)


Yes No ( %)


Yes No ( %)

11. Each paragraph presents one distinct and unified idea. Yes No ( %)

12. Each paragraph is connected to the rest of the essay. Yes No ( %)

13. Ideas are developed or expanded well throughout each paragraph.

Yes No ( %)

14. Transition devices are used effectively. Yes No ( %)


Yes No ( %)

16. This essay demonstrates an understanding of English word order.

Yes No ( %)

17. This essay contains few sentence fragments. Yes No ( %)

18. This essay contains few run-on sentences or comma splices. Yes No ( %)


Yes No ( %)

20. Verb tenses are used appropriately. Yes No ( %)

21. There is consistent subject-verb agreement. Yes No ( %)

22. Singular and plural nouns are used appropriately. Yes No ( %)

23. Prepositions are used appropriately. Yes No ( %)

270

Appendix J (Continued)

24. Articles are used appropriately. Yes No ( %)

25. Pronouns agree with referents. Yes No ( %)

26. Sophisticated or advanced vocabulary is used. Yes No ( %)

27. A wide range of vocabulary is used. Yes No ( %)


Yes No ( %)


Yes No ( %)


Yes No ( %)

31. Words are spelled correctly. Yes No ( %)

32. Punctuation marks are used appropriately. Yes No ( %)

33. Capital letters are used appropriately. Yes No ( %)

34. This essay contains appropriate indentation. Yes No ( %)

35. Appropriate tone and register are used throughout the essay. Yes No ( %)

271

APPENDIX K

ASSESSMENT GUIDELINES I (FOR THE PILOT STUDY)

Dear Teachers,

Thank you so much for your interest in my doctoral dissertation study. I am

conducting a study that examines an effective way to provide diagnostic feedback to ESL

writers on a timed essay test. Due to the complex nature of second language writing, ESL

learners need to be well-informed about the strengths and weaknesses in their writing.

Despite the interest in and need for a diagnostic approach in ESL writing instruction and

assessment, however, little is known about what kind of linguistic skills or strategies

must be diagnosed, and in what ways. It is thus critical to have the opinions of ESL

writing teachers as a source of accurate diagnostic feedback to ESL learners. The

information they provide will ultimately result in a detailed diagnostic description that

can be tailored to and made available to individual ESL writers.

Over the past few months, I have worked with nine ESL writing teachers to

develop a diagnostic assessment scheme. The teachers were invited to a think-aloud

session in which they verbally reported their thinking processes while providing

diagnostic feedback on 10 ESL timed essays. The essays were written by adult ESL

learners with a wide range of English proficiency levels in a large-scale testing setting

within 30 minutes. The verbal data that the teachers provided were analyzed, and

emerging themes were coded. Thirty-nine separate themes were identified, each

consisting of one descriptor of ESL academic writing. These 39 descriptors were then

reviewed by four PhD students specializing in ESL writing. The outcome of these experts‟

review resulted in the deletion of four descriptors. Using the remaining 35 descriptors, I

have created a diagnostic assessment scheme, called “Empirically-derived Descriptor-

based Diagnostic (EDD) checklist.”

Now, I would like you to mark the enclosed essays using the EDD checklist.

Before marking the essays, please read the EDD checklist carefully and internalize it.

You will be asked to answer yes or no to each descriptor in relation to each essay. I

understand that it is not easy to determine the cut-off of yes or no. If you think a writer

generally meets the criteria of the descriptor, it should be considered a yes. Otherwise, it

is considered a no. The term generally indicates the state in which you do not feel

distracted or your comprehension is not compromised by a student‟s mistake on the skill

being assessed. When you make this decision for each descriptor, please specify your

confidence level in the blank box next to Yes No (please specify your confidence

level on 10 essays [i.e., 5 essays × 2 prompts]). If you are extremely confident in using

the descriptor, your confidence level will be 100%. On the other hand, if you are not

confident at all in answering yes or no to a descriptor, then your confidence level will be

0%. You can specify your confidence level anywhere along the continuum between 0%

272

and 100% (e.g., 30%, 50%, 70%, etc).

Below, I will explain the meaning of some descriptors that might cause

confusion to you. I selected only a few descriptors to help your understanding; however,

if there is anything that you are not sure of, please do not hesitate to let me know.


: If a writer addresses a topic that is not relevant to the given question, or does

not respond to the specific instructions in the prompt, he or she would not

satisfy this descriptor.


: As long as a writer presents a minimum of two supporting ideas and examples

in his or her essay, he or she would satisfy this descriptor.

9. The ideas are organized into paragraphs and include an introduction, a body,

and a conclusion.

: If a writer does not include an introduction, a body, and (not or) a conclusion

in his or her essay, he or she would not satisfy this descriptor.

15. This essay demonstrates syntactic variety.

: If a writer demonstrates the ability to use a variety of syntactic structures

including simple, compound, and complex sentences, he or she would satisfy

this descriptor.


: If a writer uses a broad range of vocabulary and varied synonyms, he or she

would satisfy this descriptor. If, on the other hand, a writer uses the same

words repeatedly, he or she would not satisfy this descriptor.


: If a writer employs inappropriate word choices without knowing the accurate

meaning of the words, he or she would not satisfy this descriptor. For example,

if an essay reads “study extravagant subjects,” the writer obviously does not

know the accurate meaning of „extravagant.‟


: If a writer uses collocations inappropriately, he or she would not satisfy this

descriptor. For example, if an essay reads “a person does a decision” instead

of “a person makes a decision,” the writer would not satisfy this descriptor. In

addition, if an essay shows awkward word-for-word translations, the writer

would not satisfy this descriptor.

273


: If an essay reads “Canada is safety” instead of “Canada is safe,” the writer



: If a writer does not use punctuation marks (i.e., commas, full stops, colons,

question marks, quotation marks, etc) appropriately, he or she would not

satisfy this descriptor. For example, if a writer uses a comma in the wrong

place or does not know how to use colons correctly, he or she would not



: If a writer does not employ appropriate academic tone and register, he or she

would not satisfy this descriptor. For example, “in a nutshell” is too colloquial

to be used in academic writing.

When you mark the essays using the EDD checklist, please do not forget to write

down the number of the essay that you are marking on the EDD checklist.

If you have any questions about the EDD checklist or any other concerns, please

do not hesitate to contact me at [email protected] Thank you again for your

support for the study.

Sincerely,

Youn-Hee Kim

Ph.D. candidate, Second Language Education


Ontario Institute for Studies in Education, University of Toronto

Email: [email protected]

mailto:[email protected]


274

APPENDIX L

ASSESSMENT GUIDELINES II (FOR THE MAIN STUDY)

Dear Teachers,

Thank you so much for your interest in my doctoral dissertation study. I am

conducting a study that examines an effective way to provide diagnostic feedback to ESL

writers on a timed essay test. Due to the complex nature of second language writing, ESL

learners need to be well-informed about the strengths and weaknesses in their writing.

Despite the interest in and need for a diagnostic approach in ESL writing instruction and

assessment, however, little is known about what kind of linguistic skills or strategies

must be diagnosed, and in what ways. It is thus critical to have the opinions of ESL

writing teachers as a source of accurate diagnostic feedback to ESL learners. The

information they provide will ultimately result in a detailed diagnostic description that

can be tailored to and made available to individual ESL writers.

Over the past few months, I have worked with nine ESL writing teachers to

develop a diagnostic assessment scheme. The teachers were invited to a think-aloud

session in which they verbally reported their thinking processes while providing

diagnostic feedback on 10 ESL timed essays. The essays were written by adult ESL

learners with a wide range of English proficiency levels in a large-scale testing setting

within 30 minutes. The verbal data that the teachers provided were analyzed, and

emerging themes were coded. Thirty-nine separate themes were identified, each

consisting of one descriptor of ESL academic writing. These 39 descriptors were then

reviewed by four PhD students specializing in ESL writing. The outcome of these experts‟

review resulted in the deletion of four descriptors. Using the remaining 35 descriptors, I

have created a diagnostic assessment scheme, called “Empirically-derived Descriptor-

based Diagnostic (EDD) checklist.”

Now, I would like you to mark the enclosed essays using the EDD checklist.

Before marking the essays, please read the EDD checklist carefully and internalize it.

You will be asked to answer yes or no to each descriptor in relation to each essay. I

understand that it is not easy to determine the cut-off of yes or no. If you think a writer

generally meets the criteria of the descriptor, it should be considered a yes. Otherwise, it

is considered a no. The term generally indicates the state in which you do not feel

distracted or your comprehension is not compromised by a student‟s mistake on the skill

being assessed. When you make this decision for each descriptor, please specify your

confidence level in the blank box next to Yes No (please specify your confidence

level on 10 essays [i.e., 5 essays × 2 prompts]). If you are extremely confident in using

the descriptor, your confidence level will be 100%. On the other hand, if you are not

confident at all in answering yes or no to a descriptor, then your confidence level will be

0%. You can specify your confidence level anywhere along the continuum between 0%

275

and 100% (e.g., 30%, 50%, 70%, etc).

Below, I will explain the meaning of some descriptors that might cause

confusion to you. I selected only a few descriptors to help your understanding; however,

if there is anything that you are not sure of, please do not hesitate to let me know.


: If a writer addresses a topic that is not relevant to the given question, or does

not respond to the specific instructions in the prompt, he or she would not



: As long as a writer presents a minimum of two supporting ideas and examples

in his or her essay, he or she would satisfy this descriptor.

9. The ideas are organized into paragraphs and include an introduction, a body,

and a conclusion.

: If a writer does not include an introduction, a body, and (not or) a conclusion

in his or her essay, he or she would not satisfy this descriptor.

15. This essay demonstrates syntactic variety.

: If a writer demonstrates the ability to use a variety of syntactic structures

including simple, compound, and complex sentences, he or she would satisfy

this descriptor.


: If a writer uses a broad range of vocabulary and varied synonyms, he or she

would satisfy this descriptor. If, on the other hand, a writer uses the same

words repeatedly, he or she would not satisfy this descriptor.


: If a writer employs inappropriate word choices without knowing the accurate

meaning of the words, he or she would not satisfy this descriptor. For example,

if an essay reads “study extravagant subjects,” the writer obviously does not

know the accurate meaning of „extravagant.‟


: If a writer uses collocations inappropriately, he or she would not satisfy this

descriptor. For example, if an essay reads “a person does a decision” instead

of “a person makes a decision,” the writer would not satisfy this descriptor. In

addition, if an essay shows awkward word-for-word translations, the writer


276


: If an essay reads “Canada is safety” instead of “Canada is safe,” the writer



: If a writer does not use punctuation marks (i.e., commas, full stops, colons,

question marks, quotation marks, etc) appropriately, he or she would not

satisfy this descriptor. For example, if a writer uses a comma in the wrong

place or does not know how to use colons correctly, he or she would not



: If a writer does not use approximately five to seven spaces to indent the first

sentence of each paragraph, he or she would not satisfy this descriptor.


: If a writer does not employ appropriate academic tone and register, he or she

would not satisfy this descriptor. For example, “in a nutshell” is too colloquial

to be used in academic writing.

Please take note of the following points:

(1) When you determine what constitutes „few‟ on descriptors 3, 17, and 18,

consider how noticeable the linguistic errors are. For example, if you find

that fragmentary sentences draw your attention, the essay would not satisfy

descriptor 17.

3. This essay is concisely written and contains few redundant ideas or

linguistic expressions.

17. This essay contains few sentence fragments.

18. This essay contains few run-on sentence fragments.

(2) There are fundamental differences between descriptors 6, 7, and 8.


7. The supporting ideas and examples in this essay are appropriate and

logical.


(3) There is also a difference between descriptors 2 and 19:

2. This essay is written clearly enough to be read without having to guess

what the writer is trying to say.

277

19. Grammatical or linguistic errors in this essay do not impede

comprehension.

: While descriptor 2 indicates that an essay might not be read easily for

many reasons (e.g., poor organization, poor content, or linguistic errors),

descriptor 19 focuses primarily on grammatical or linguistic errors that

impede comprehension.

(4) When determining the degree of „vocabulary sophistication‟ and „vocabulary

breadth‟ on descriptors 26 and 27, consider the context in which the essays

were written. These essays were written by adult ESL students who wish to

be admitted to a college/university or a graduate school in English-speaking

countries.

26. Sophisticated or advanced vocabulary is used.

27. A wide-range of vocabulary is used.

(5) Also, please pay attention to the slight difference between descriptors 9 and

34.

9. The ideas are organized into paragraphs and include an introduction, a

body, and a conclusion.


: While descriptor 9 focuses on whether a writer is able to organize his

or her ideas into paragraphs using an appropriate essay structure (i.e.,

introduction, body, and conclusion), descriptor 34 asks whether a writer

has indented the first sentence of each paragraph to make a visual

distinction between the paragraphs.

(6) If a writer does not employ the relevant linguistic features, please do not

mark the yes or no box.

14. Transition devices are used effectively.

: If a writer does not employ transition devices at all, please do not

mark the yes or no box.


: If a writer does not employ collocations at all, please do not mark the

yes or no box.

When you mark the essays using the EDD checklist, please do not forget to write

down the number of the essay that you are marking on the EDD checklist.

If you have any questions about the EDD checklist or any other concerns, please

do not hesitate to contact me at [email protected] Thank you again for your


278

support for the study.

Sincerely,

Youn-Hee Kim

Ph.D. candidate, Second Language Education


Ontario Institute for Studies in Education, University of Toronto

Email: [email protected]


279

APPENDIX M

CORRELATIONS BETWEEN ETS SCORES AND TEACHER SCORES

Table M-1

Correlation Matrix of Essay Set 1

ETS Ann Shelley Sarah

ETS 1.00

Ann .82**

1.00

Shelley .84**

.80**

1.00

Sarah .89 .90**

.79**

1.00

** indicates p < .01

Table M-2


ETS James Beth George

ETS 1.00

James .88**

1.00

Beth .86**

.76* 1.00

George .77**

.75* .76

* 1.00

** indicates p < .01,

* indicates p < .05

Table M-3


ETS Judy Tim Esther

ETS 1.00

Judy .98**

1.00

Tim .98**

.96**

1.00

Esther .95**

.92**

.93**

1.00

** indicates p < .01

Note. The correlation coefficient found in Essay Set 3, which contained shorter essays, was greater than

those in Essay Sets 1 and 2. Further research is recommended about the relationship between essay length

and the magnitude of correlation coefficients.

280

APPENDIX N

DESCRIPTOR MEASURE STATISTICS

Descriptor Obsvd

Average

Fair-M

Average

Measure

(logits)

Model

S.E.

Infit

MnSq

Outfit

MnSq

Corr.

PtBis

D01 0.6 0.59 0.10 0.15 1.17 1.28 0.11

D02 0.6 0.58 0.14 0.15 0.86 0.84 0.36

D03 0.5 0.46 0.64 0.15 0.90 0.90 0.33

D04 0.6 0.60 0.05 0.15 0.99 1.06 0.27

D05 0.4 0.35 1.06 0.15 0.92 0.88 0.29

D06 0.4 0.41 0.84 0.15 0.94 0.91 0.29

D07 0.5 0.46 0.64 0.15 0.93 0.87 0.31

D08 0.4 0.42 0.79 0.15 0.92 0.89 0.31

D09 0.7 0.75 -0.64 0.17 1.12 1.19 0.16

D10 0.4 0.41 0.84 0.15 0.93 0.93 0.30

D11 0.6 0.59 0.10 0.15 0.95 0.92 0.31

D12 0.7 0.69 -0.33 0.16 0.97 0.93 0.28

D13 0.6 0.61 0.02 0.15 0.99 0.97 0.27

D14 0.4 0.43 0.73 0.15 1.06 1.10 0.21

D15 0.7 0.67 -0.25 0.16 0.85 0.79 0.39

D16 0.9 0.88 -1.55 0.21 0.96 0.89 0.22

D17 0.6 0.64 -0.09 0.15 0.93 0.91 0.32

D18 0.6 0.55 0.25 0.15 1.19 1.28 0.10

D19 0.5 0.47 0.59 0.15 0.89 0.86 0.32

D20 0.6 0.67 -0.23 0.16 1.08 1.09 0.18

D21 0.8 0.82 -1.07 0.18 1.03 0.99 0.21

D22 0.8 0.84 -1.19 0.19 1.12 1.27 0.12

D23 0.7 0.69 -0.35 0.16 1.02 1.02 0.21

D24 0.6 0.66 -0.18 0.16 1.16 1.23 0.12

D25 0.8 0.83 -1.12 0.18 0.94 0.99 0.25

D26 0.3 0.28 1.41 0.16 0.92 0.85 0.30

D27 0.5 0.46 0.64 0.15 0.87 0.83 0.36

D28 0.7 0.71 -0.43 0.16 1.03 1.06 0.20

D29 0.4 0.38 0.93 0.15 0.85 0.79 0.36

D30 0.6 0.58 0.13 0.15 1.04 1.04 0.20

D31 0.5 0.50 0.46 0.15 1.09 1.12 0.17

D32 0.6 0.62 -0.04 0.15 0.98 0.99 0.26

D33 0.8 0.85 -1.23 0.19 1.05 1.21 0.17

D34 0.6 0.57 0.17 0.15 1.27 1.35 0.06

D35 0.9 0.91 -1.82 0.22 1.04 1.29 0.10

281

APPENDIX O

THE INITIAL Q-MATRIX

Descriptor CON ORG GRM VOC MCH

D01 1 0 0 0 0

D02 1 1 0 0 0

D03 1 1 0 1 0

D04 1 1 0 0 0

D05 1 1 0 0 0

D06 1 0 0 0 0

D07 1 1 0 0 0

D08 1 0 0 0 0

D09 0 1 0 0 0

D10 0 1 0 0 0

D11 1 1 0 0 0

D12 0 1 0 0 0

D13 1 1 0 0 0

D14 0 1 1 1 0

D15 0 0 1 0 0

D16 0 0 1 0 0

D17 0 0 1 0 1

D18 0 0 1 0 1

D19 0 0 1 0 0

D20 0 0 1 0 0

D21 0 0 1 0 0

D22 0 0 1 0 0

D23 0 0 1 0 0

D24 0 0 1 0 0

D25 0 0 1 0 0

D26 0 0 0 1 0

D27 0 0 0 1 0

D28 0 0 0 1 0

D29 0 0 1 1 0

D30 0 0 1 1 0

D31 0 0 1 0 1

D32 0 0 0 0 1

D33 0 0 1 0 1

D34 0 1 0 0 1

D35 1 1 1 1 1

an argument-based validity inquiry into the empirically ... · this study built and supported...

Documents